Apply Host Profile for Cell Site Host(s) task hangs or takes more than an hour to complete.

Products

VMware Telco Cloud Automation

Issue/Introduction

- Change TCA Infrastructure Automation behavior to only apply the host profile configuration to newly added cell site hosts in a cell site group as opposed to all hosts in the cell site group domain.
- Ensure TCA host profile apply tasks will complete within 15 minutes.
- Provide customizable parameter to configure a timeout for applying a TCA host profile configuration.

NOTE: If the BIOS or Firmware is not involved, the time to configure a host profile should not exceed 15 minutes. The user customize this timeout value at the TCA operator side, avoiding the 1 hour wait.

NOTE: These patches do persist after rebooting the TCA Manager.

NOTE: The 12GB of allocated memory (as per the TCA deployment guide) should be reserved in its entirely (100% reserved). This memory reservation is for TCA-CP, not both TCA and TCA-CP.

Symptoms:

In Telco Cloud Automation (TCA) 2.0.X and 1.9.5, the task to apply a Host Profile for Cell Site Host(s) task hangs or is taking more than an hour to complete.

Environment

1.9.5, 2.0.x

Cause

If one cell site host is in an unhealthy state and a user tries to add a new cell site host into the same cell site group, a resync is triggered on the host that is in a failed state causing the original add host operation to hang.

When a new host is added to a cell site domain with 1 or more previously provisioned hosts, a combined host profile task is initiated on all the hosts inside the domain, new ones plus provisioned hosts. This results in the Apply host profile task getting hung for a newly added host as the host profile task may not work on the previously provisioned host if certain conditions are not met e.g., nodepool was already deployed, the host is unreachable, the host is having NIC issues, etc.)

Resolution

Resolved in Telco Cloud Automation 2.1

Workaround:
1. Take snapshot of TCA Manager VM. For more details refer to Take a Snapshot in the VMware Host Client.

2. SSH into the TCA Manager as admin and switch user to root.

3. Download the install_ztp_host_config_patch.tar.gz and hostconfig-service-jar.tar.gz patch files and copy them to the following directory in the TCA Manager:
/home/admin

4. Open the ZTP container bash session using:
docker exec -u root -it tcf-manager bash
This will put you in the ZTP container bash under /opt/vmware/tcf.

5. Copy the patch file install_ztp_host_config_patch.tar.gz to the /opt/vmware/tcf directory.
scp admin@<tca-manager>:/home/admin/install_ztp_host_config_patch.tar.gz .

6. Untar the patch using:
tar -xvzf install_ztp_host_config_patch.tar.gz

7. Go to the patch directory:
cd install_ztp_host_config_patch

8. Execute the patch using README.txt present in the folder. Note that the patch must be run from the /opt/vmware/tcf/install_ztp_host_config_patch directory.

8.1. Change the permission for run_ztp_host_config_patch.sh with command:
chmod 777 run_ztp_host_config_patch.sh
8.2 Run run_ztp_host_config_patch.sh
./run_ztp_host_config_patch.sh
8.3. Exit from the container
exit
8.4. Restart tcf-manager
systemctl restart tcf-manager
8.5 Check the tcf-manager service status after restarting it and ensure you can log into the ZTP container and view the Infrastructure Automation UI.

9. Now to patch the TCA Manager jar, ensure you have exited the ZTP container. Switch to the /home/admin directory and untar the hostconfig-service-jar.tar.gz. This will extract the hostconfig-service-jar file.
tar -xvzf hostconfig-service-jar.tar.gz

10. Switch to the /opt/vmware/Services/hostconfig-service_1.0 directory on the TCA Manager:
cd /opt/vmware/Services/hostconfig-service_1.0

11. Backup the existing jar using:
cp hostconfig-service-1.0.jar hostconfig-service-1.0.jar.bck

12. Copy the hostconfig-service-jar that was previously downloaded and extracted from /home/admin to the /opt/vmware/Services/hostconfig-service_1.0 directory. This will overwrite the existing file so make sure you have backed it up as noted in the previous step.
cp /home/admin/hostconfig-service-1.0.jar /opt/vmware/Services/hostconfig-service_1.0

13. Restart App Engine using:
systemctl restart app-engine

14. Check the App Engine service status, ensure it is up and running.

15. To patch the hostconfig operator, log in to each TCA-CP as admin and execute this command:
a.
export MINIKUBE_HOME=/common/minikube; export KUBECONFIG=/home/admin/.kube/config

b.
kubectl patch deployment -n tca-system hostconfig-operator -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","image":"vmwaresaas.jfrog.io/registry/hostconfig-operator:2.0.0.1", "imagePullPolicy": "Always", "args": ["--health-probe-bind-address=:8081","--timeout=15"],"resources":{"limits":{"memory":"600Mi"}}}]}}}}'

16. Check the status of the hostconfig operator POD after running the previous command:
kubectl get po -n tca-system

Note: Based on the need, the timeout period can be customized with this command. With this command the timeout will be set to 15 minutes which is the VMware recommended timeout value. If there is a reason to increase or decrease the timeout value, it should be done only after consulting VMware

Additional Information

Telco Cloud Automation fails to identify if the host profile is already applied to a provisioned cell site host(s) and gets hung when trying to apply host profile configurations on a provisioned host(s). This in turn delays the host profile task for any newly provisioned hosts in a healthy state.

Attachments

install_ztp_host_config_patch.tar get_app

hostconfig-service-jar.tar get_app