Provide a method to eliminate the occurrences of DNS timeout errors seen in the Infrastructure Automation pod running in the TCA Management cluster.
This procedure will allow to resume Day 1 Infrastructure Automation operations in the cluster.
Day 1 Infrastructure Automation operations when running on the TCA Management cluster fails due to DNS Timeout issues in the AIRGAP Environment.
Specifically we see the following errors in Infrastructure Automation logs for different appliances indefinitely: "The DNS operation timed out after 5.103232392323231 seconds".
This halts all the deployments like RDC management/workload domain, CDC Workload domain, Compute cluster, Cell Site group, host-profile and other Day 1 Infrastructure Automation operations on the cluster.
NOTE: This issue is specifically seen after TCA state is migrated from Bootstrapper VM to TCA Management cluster
Symptom 1: Multiple Alerts seen in Infrastructure Automation UI for domains representing FQDN resolution errors
Symptom 2: Errors are flooded in tcf_manager.log for the Infrastructure Automation pod running in the TCA Management cluster.
FQDN resolution fails specifically for all the appliances when Infrastructure Automation pod runs in the TCA Management cluster.
Analysis suggests CoreDNS conflicts as DNS resolution python library used in Infrastructure Automation fails to resolve names as required in the cluster.
Step 1:
SSH to TCA Bootstrapper VM as root user.
Step 2:
Download and Execute the patch provided as Shell script (airgap_ztp_patch.sh)
Step 3:
Verify if tca-tcf-manager-0 (Infrastructure Automation pod) is UP and RUNNING using below command:
kubectl get pods -n tca-mgr | grep tca-tcf-manager-0
Once the patch is applied to solve the issue, verify whether the listed issue is solved.
Case 1: Alerts in Infrastructure Automation UI are now cleared for the respective domain.
Case 2: DNS timeout errors previously seen in tcf_manager.log are not seen anymore.
To verify this step:
1) Export kubeconfig
export KUBECONFIG=$(find /opt/vmware/k8s-bootstrapper/ -name kubeconfig | tail -1)
2) Execute below command to enter into shell of Infrastructure Automation pod running in the cluster:
kubectl exec -i -t tca-tcf-manager-0 -n tca-mgr bash
3) As a next step, check /var/log/tcf_manager.log
Log file should not display any DNS timeout errors