Provide a method to eliminate the occurrences of DNS timeout errors seen in the Infrastructure Automation pod running in the TCA Management cluster.
This procedure will allow to resume Day 1 Infrastructure Automation operations in the cluster.
Symptom 1: Multiple Alerts seen in Infrastructure Automation UI for domains representing FQDN resolution errors
Symptom 2: Errors are flooded in tcf_manager.log for the Infrastructure Automation pod running in the TCA Management cluster.
1.x, 2.x
FQDN resolution fails specifically for all the appliances when Infrastructure Automation pod runs in the TCA Management cluster.
Analysis suggests CoreDNS conflicts as DNS resolution python library used in Infrastructure Automation fails to resolve names as required in the cluster.
Step 1:
SSH to TCA Bootstrapper VM as root user.
Step 2:
Download and Execute the patch provided as Shell script (airgap_ztp_patch.sh)
Step 3:
Verify if tca-tcf-manager-0 (Infrastructure Automation pod) is UP and RUNNING using below command:
kubectl get pods -n tca-mgr | grep tca-tcf-manager-0
Once the patch is applied to solve the issue, verify whether the listed issue is solved.
Case 1: Alerts in Infrastructure Automation UI are now cleared for the respective domain.
Case 2: DNS timeout errors previously seen in tcf_manager.log are not seen anymore.
To verify this step:
1) Export kubeconfig
export KUBECONFIG=$(find /opt/vmware/k8s-bootstrapper/ -name kubeconfig | tail -1)
2) Execute below command to enter into shell of Infrastructure Automation pod running in the cluster:
kubectl exec -i -t tca-tcf-manager-0 -n tca-mgr bash
3) As a next step, check /var/log/tcf_manager.log
Log file should not display any DNS timeout errors