When trying to install VKS Cluster Management, VCF Operations web UI reports that it is unhealthy with below error message:
Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile deployment/auto-attach (apps/v1) namespace: svc-auto-attach-domain-c#: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "auto-attach-<ID>" has timed out progressing.)).
The above error is propagated to VCF Operations from the vCenter Supervisor Management -> Supervisor Services page for the auto-attach Supervisor service.
This error originates from the status of the auto-attach Supervisor service packageInstall (PKGi) in the Supervisor cluster which can be queried as per below:
While connected to the Supervisor cluster context:
kubectl get pkgi -n vmware-system-supervisor-services
kubectl describe pkgi -n vmware-system-supervisor-services <auto-attach-pkgi-name>
The above ReconcileFailed error message indicates that the auto-attach pods have failed to reach a healthy, Running state.
While connected to the Supervisor cluster context, the following symptoms are observed:
One or more auto-attach pods are stuck in Pending or ProviderFailed state:
These pods reach ProviderFailed state around 120 minutes in age, and a new pod is spun up in Pending state.
This can result in a long list of auto-attach pods accumulating over time stuck in ProviderFailed state.
kubectl get pods -A -o wide | grep "auto-attach"
NAMESPACE POD NAME STATUS
svc-auto-attach-domain-c# auto-attach-<ID> 0/1 ProviderFailed
svc-auto-attach-domain-c# auto-attach-<ID> 0/1 Pending
kubectl describe pod -n svc-auto-attach-domain-c# <auto-attach-pod-name>
network setup failure: context deadline exceeded
kubectl describe pod -n svc-auto-attach-domain-c# <auto-attach-pod-name>
cfgAgent returned CONFIG_INEXISTENCE
kubectl logs -n vmware-system-nsx <nsx-ncp pod name> -c nsx-operator
error: node <ESXi host name> not found
node <ESXi host name> not found yet in NSX side.
These errors indicate that NSX does not recognize the noted ESXi host(s) and cannot successfully start the auto-attach pod on the ESXi host.
Spherelet logs in the ESXi hosts will report failures to monitor the auto-attacher pods, similar to the below:
cat /var/log/spherelet.log
failed to retrieve pod status for auto-attach-<ID> - PodNotFound, reason: status=Pending
vSphere Cloud Foundation (VCF) 9
vSphere Supervisor
The errors in the NSX-NCP nsx-operator container indicate that NSX cannot successfully start the auto-attach pod on the corresponding ESXi host because NSX-NCP does not recognize the ESXi host(s).
This is due to hostname differences between vCenter, NSX and the Supervisor cluster for the noted ESXi host(s).
NSX-NCP is the bridge between the Supervisor cluster and NSX.
ESXi node names in the Supervisor cluster will be synced with vCenter TCP/IP Configuration Default stack configuration for the ESXi host.
This issue can occur if there are no typos or misconfigurations with the domain or hostname of the ESXi host but the ESXi hosts have case differences.
For example, vCenter may have all ESXi hosts showing in all uppercase, whereas NSX-NCP and the Supervisor cluster views the ESXi hosts as all lowercase.
It is recommended to use all lowercase.
Verify that there is a mismatch between the ESXi hostnames in vCenter, Supervisor cluster and NSX:
hostname
kubectl get nodes
kubectl logs -n vmware-system-nsx <nsx-ncp pod name> -c nsx-operator
error: node <ESXi host name> not found
node <ESXi host name> not found yet in NSX side.
Correct the mismatches found in the above steps accordingly:
kubectl get nodes | grep <ESXi host>
kubectl delete node <ESXi host>
kubectl get nodes
kubectl logs -n vmware-system-nsx <nsx-ncp pod name> -c nsx-operator | grep <ESXi host>
kubectl get pods -A -o wide | grep "auto-attach"
kubectl delete pod -n svc-auto-attach-domain-c# <auto-attach-pod-name>
kubectl get pods -A -o wide | grep "auto-attach"