VKS Cluster Management Failed to Install - Auto-Attach Pods ProviderFailed due to Hostname Mismatch

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When trying to install VKS Cluster Management, VCF Operations web UI reports that it is unhealthy with below error message:

Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile deployment/auto-attach (apps/v1) namespace: svc-auto-attach-domain-c#: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "auto-attach-<ID>" has timed out progressing.)).

The above error is propagated to VCF Operations from the vCenter Supervisor Management -> Supervisor Services page for the auto-attach Supervisor service.

This error originates from the status of the auto-attach Supervisor service packageInstall (PKGi) in the Supervisor cluster which can be queried as per below:

While connected to the Supervisor cluster context:

kubectl get pkgi -n vmware-system-supervisor-services

kubectl describe pkgi -n vmware-system-supervisor-services <auto-attach-pkgi-name>

The above ReconcileFailed error message indicates that the auto-attach pods have failed to reach a healthy, Running state.

While connected to the Supervisor cluster context, the following symptoms are observed:

One or more auto-attach pods are stuck in Pending or ProviderFailed state:

These pods reach ProviderFailed state around 120 minutes in age, and a new pod is spun up in Pending state.

This can result in a long list of auto-attach pods accumulating over time stuck in ProviderFailed state.
```
kubectl get pods -A -o wide | grep "auto-attach"

NAMESPACE                         POD NAME                    STATUS
svc-auto-attach-domain-c#        auto-attach-<ID>     0/1 ProviderFailed
svc-auto-attach-domain-c#        auto-attach-<ID>     0/1 Pending
```

Describing an auto-attach pod in ProviderFailed state shows the following:

kubectl describe pod -n svc-auto-attach-domain-c# <auto-attach-pod-name>

network setup failure: context deadline exceeded

Whereas describing an auto-attach pod in Pending state shows the below error:

kubectl describe pod -n svc-auto-attach-domain-c# <auto-attach-pod-name>

cfgAgent returned CONFIG_INEXISTENCE

Checking NSX-NCP nsx-operator container logs shows error messages similar to the following:
```
kubectl logs -n vmware-system-nsx <nsx-ncp pod name> -c nsx-operator

error: node <ESXi host name> not found
node <ESXi host name> not found yet in NSX side.
```
These errors indicate that NSX does not recognize the noted ESXi host(s) and cannot successfully start the auto-attach pod on the ESXi host.

Spherelet logs in the ESXi hosts will report failures to monitor the auto-attacher pods, similar to the below:

cat /var/log/spherelet.log

failed to retrieve pod status for auto-attach-<ID> - PodNotFound, reason: status=Pending

Environment

vSphere Cloud Foundation (VCF) 9

vSphere Supervisor

Cause

The errors in the NSX-NCP nsx-operator container indicate that NSX cannot successfully start the auto-attach pod on the corresponding ESXi host because NSX-NCP does not recognize the ESXi host(s).

This is due to hostname differences between vCenter, NSX and the Supervisor cluster for the noted ESXi host(s).

NSX-NCP is the bridge between the Supervisor cluster and NSX.

ESXi node names in the Supervisor cluster will be synced with vCenter TCP/IP Configuration Default stack configuration for the ESXi host.

This issue can occur if there are no typos or misconfigurations with the domain or hostname of the ESXi host but the ESXi hosts have case differences.

For example, vCenter may have all ESXi hosts showing in all uppercase, whereas NSX-NCP and the Supervisor cluster views the ESXi hosts as all lowercase.

It is recommended to use all lowercase.

Resolution

Initial Checks

Verify that there is a mismatch between the ESXi hostnames in vCenter, Supervisor cluster and NSX:

From the vCenter web UI, navigate to the ESXi host's Configure -> Networking -> TCP/IP Networking and click on the triple dot (:) to Edit the Default stack.
Verify the ESXi hostname and domain.
While SSH to the affected ESXi host, perform the following command to check its hostname:
```
hostname
```
From the Supervisor cluster context, check the list of nodes to verify the ESXi hostnames and domain:
```
kubectl get nodes
```

From the Supervisor cluster context, check the NSX-NCP pod logs to verify the ESXi hostnames and domain seen by NSX:

kubectl logs -n vmware-system-nsx <nsx-ncp pod name> -c nsx-operator

error: node <ESXi host name> not found
node <ESXi host name> not found yet in NSX side.

The hostnames and domain of the ESXi host need to match and have the same case (uppercase, lowercase) between the Supervisor Cluster, vCenter and NSX.

Resolution

Correct the mismatches found in the above steps accordingly:

Place the affected ESXi host into maintenance mode
Once the ESXi host is in maintenance mode, navigate in the vCenter web UI to the ESXi host's Configure -> Networking -> TCP/IP Networking
Edit the Default TCP/IP Stack to correct the hostname, domain and case (uppercase, lowercase) as necessary.
- It is recommended to use all lowercase.
If the Supervisor cluster's ESXi node entry needs to be corrected, ensure that the above Default TCP/IP Stack is correct and tell Kubernetes to recreate the entry:
While connected to the Supervisor cluster context:
```
kubectl get nodes | grep <ESXi host>

kubectl delete node <ESXi host>
```
Take the ESXi host out of maintenance mode.
Once the ESXi host is out of maintenance mode, the corrected ESXi host name should appear in the list of nodes from the Supervisor cluster:
```
kubectl get nodes
```
Confirm that NSX-NCP nsx-operator container logs no longer report failures to find the corrected ESXi host:
```
kubectl logs -n vmware-system-nsx <nsx-ncp pod name> -c nsx-operator | grep <ESXi host>
```
The above ESXi hostname and domain corrections should be performed on all ESXi hosts associated with the Supervisor cluster as necessary.
Once all ESXi hosts have been corrected, clean up the failed auto-attach pods.
The Pending auto-attach pod can be deleted to recreate on the corrected ESXi host(s).
```
kubectl get pods -A -o wide | grep "auto-attach"

kubectl delete pod -n svc-auto-attach-domain-c# <auto-attach-pod-name>
```
Confirm that the newly recreated auto-attach pod reaches Running state:
```
kubectl get pods -A -o wide | grep "auto-attach"
```