DNS Timeout issues seen in Infrastructure Automation running in TCA Management cluster for Airgap Environment
search cancel

DNS Timeout issues seen in Infrastructure Automation running in TCA Management cluster for Airgap Environment

book

Article ID: 314259

calendar_today

Updated On:

Products

VMware VMware Telco Cloud Automation

Issue/Introduction

Provide a method to eliminate the occurrences of DNS timeout errors seen in the Infrastructure Automation pod running in the TCA Management cluster.

This procedure will allow to resume Day 1 Infrastructure Automation operations in the cluster.


Symptoms:

Day 1 Infrastructure Automation operations when running on the TCA Management cluster fails due to DNS Timeout issues in the AIRGAP Environment.

Specifically we see the following errors in Infrastructure Automation logs for different appliances indefinitely: "The DNS operation timed out after 5.103232392323231 seconds".

This halts all the deployments like RDC management/workload domain, CDC Workload domain, Compute cluster, Cell Site group, host-profile and other Day 1 Infrastructure Automation operations on the cluster.

NOTE: This issue is specifically seen after TCA state is migrated from Bootstrapper VM to TCA Management cluster

 

Symptom 1: Multiple Alerts seen in Infrastructure Automation UI for domains representing FQDN resolution errors

FQDN resolution errors

 

Symptom 2: Errors are flooded in tcf_manager.log for the Infrastructure Automation pod running in the TCA Management cluster.

tcf_manager.log errors

 


Environment

VMware Telco Cloud Automation 1.x
VMware Telco Cloud Automation 2.0

Cause

FQDN resolution fails specifically for all the appliances when Infrastructure Automation pod runs in the TCA Management cluster.

Analysis suggests CoreDNS conflicts as DNS resolution python library used in Infrastructure Automation fails to resolve names as required in the cluster.

Resolution

Step 1:

SSH to TCA Bootstrapper VM as root user.
 

Step 2:

Download and Execute the patch provided as Shell script (airgap_ztp_patch.sh)

script execution

 

Step 3:

Verify if tca-tcf-manager-0 (Infrastructure Automation pod) is UP and RUNNING using below command:

kubectl get pods -n tca-mgr | grep tca-tcf-manager-0

get pods

 

Verification:

Once the patch is applied to solve the issue, verify whether the listed issue is solved.

Case 1: Alerts in Infrastructure Automation UI are now cleared for the respective domain.

image2021-12-10_15-53-19.png

 

Case 2: DNS timeout errors previously seen in tcf_manager.log are not seen anymore.

To verify this step:

1) Export kubeconfig

export KUBECONFIG=$(find /opt/vmware/k8s-bootstrapper/ -name kubeconfig | tail -1)

2) Execute below command to enter into shell of Infrastructure Automation pod running in the cluster:

kubectl exec -i -t tca-tcf-manager-0 -n tca-mgr bash

image2021-12-10_15-36-59.png

3) As a next step, check /var/log/tcf_manager.log

Log file should not display any DNS timeout errors

 


Attachments

airgap-ztp-patch get_app