When Guest Cluster Control Plane nodes experience an outage (degraded api-server or etcd), Control Plane node IP addresses might not return as endpoints in the Supervisor Cluster control-plane service, even after the nodes recover and show healthy api-server responses.
Symptoms:
This condition occurs when:
Impact:
The VMOP controller cannot add node IPs back to the endpoint, creating a circular dependency; existing nodes cannot be marked healthy and new nodes cannot join without healthy endpoints.
VMware vSphere 7.0 or newer with Tanzu, guest clusters with multiple control plane nodes
Guest Cluster Control Plane health probes defined by CAPI use port 6443, which is the kube-api server on each VM. The VMOP controller on Supervisor Cluster queries these health probes against the Control Plane VirtualMachine objects to determine endpoint health status.
When performing health checks, the VMOP controller validates VirtualMachineImage compliance before adding a node IP address back to the endpoint object. If the VMImage of the Guest Cluster Control Plane VM doesn't match the VMImage expected in the KubeadmControlPlane template, the Control Plane node IP addresses will not be added to the control-plane endpoint by VMOP. This occurs despite healthy etcd and api-server responsiveness on the Control Plane nodes when pointing to the local IP address for kubectl commands.
The mismatching VMImage preventing Control Plane node IPs from being added back to the control-plane endpoint can only be corrected with new node rollouts. This creates a deadlock condition: existing healthy Control Plane nodes cannot be marked healthy for endpoint update due to the image mismatch, but new Control Plane nodes with the correct VMImage cannot complete kubeadm join operations without existing node IPs in the endpoint.
Workaround:
WARNING: This workaround temporarily blocks VMOP from automatically adding new Control Plane node IPs to the Endpoints on Supervisor Cluster. You must restore the patch privilege after new nodes join the cluster.
ifconfigadmin.conf file:cp /etc/kubernetes/admin.conf /etc/kubernetes/admin.conf.bakadmin.conf.bak and replace the Server VIP with the local node IP:vi /etc/kubernetes/admin.conf.bakexport KUBECONFIG=/etc/kubernetes/admin.conf.bakkubectl get clusterrole vmware-system-vmop-manager-role -o yaml > cluster-role-vmware-system-vmop-manager-role.yamlkubectl edit clusterrole vmware-system-vmop-manager-role- patch" from the endpoints resource verbs section, leaving:- apiGroups: - "" resources: - endpoints verbs: - create - delete - get - list - update - watchkubectl -n <GC_NAMESPACE> patch endpoints <GC_NAME>-control-plane-service -p='{"subsets":[{"addresses":[{"ip":"<GC_NODE_IP>"},{"ip":"<GC_NODE_IP>"}],"ports":[{"name":"apiserver","port":6443,"protocol":"TCP"}]}]}'kubectl -n test-01 patch endpoints tkc-01-control-plane-service -p='{"subsets":[{"addresses":[{"ip":"10.10.5.5"},{"ip":"10.10.5.6"}],"ports":[{"name":"apiserver","port":6443,"protocol":"TCP"}]}]}'kubectl -n <GC_NAMESPACE> get endpoints <GC_NAME>-control-plane-service -o yamlkubectl edit clusterrole vmware-system-vmop-manager-role- patch" back to the endpoints resource verbs.The Guest Cluster Control Plane health probes are defined by CAPI (Cluster API) and use port 6443, which corresponds to the kube-api server on each VM. The VMOP controller on the Supervisor Cluster queries these health probes against the Control Plane VirtualMachine objects to determine endpoint health status.
When performing the VirtualMachineImage compliance check, the VMOP controller validates that the VMImage of each Guest Cluster Control Plane VM matches the VMImage specified in the KubeadmControlPlane template. This validation occurs as part of the process for adding node IP addresses back to the endpoint object after a health probe failure or recovery event.
Important Notes:
• The workaround creates a temporary state where the VMOP controller cannot automatically manage Control Plane endpoints. This is intentional to break the circular dependency but must be reversed once new nodes successfully join the cluster.
• Control Plane nodes may show as healthy when accessed directly via their local IP addresses (responding to kubectl commands and having functional etcd/api-server), but still fail the VMImage compliance check that prevents endpoint registration.
• This issue specifically manifests when ALL Control Plane nodes simultaneously enter NotReady status while duplicate VirtualMachineImages are being removed, creating the conditions for the VMImage mismatch.