Guest cluster control plane endpoints missing from supervisor cluster after outage
search cancel

Guest cluster control plane endpoints missing from supervisor cluster after outage

book

Article ID: 323450

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

When Guest Cluster Control Plane nodes experience an outage (degraded api-server or etcd), Control Plane node IP addresses might not return as endpoints in the Supervisor Cluster control-plane service, even after the nodes recover and show healthy api-server responses.

Symptoms:

  • Control Plane nodes remain accessible via SSH and respond to kubectl commands using their local IP addresses
  • Supervisor Cluster fails to recognize recovered nodes as valid endpoints
  • New Control Plane nodes cannot join the cluster during rollout operations
  • Guest Cluster remains in a degraded state preventing scale or update operations

This condition occurs when:

  • Duplicate VirtualMachineImages exist on the Supervisor Cluster and are removed
  • All Guest Cluster Control Plane nodes simultaneously become inaccessible and enter NotReady status
  • VMImage mismatch exists between Control Plane VMs and the KubeadmControlPlane template

Impact:

The VMOP controller cannot add node IPs back to the endpoint, creating a circular dependency; existing nodes cannot be marked healthy and new nodes cannot join without healthy endpoints.

Environment

VMware vSphere 7.0 or newer with Tanzu, guest clusters with multiple control plane nodes

Cause

Guest Cluster Control Plane health probes defined by CAPI use port 6443, which is the kube-api server on each VM. The VMOP controller on Supervisor Cluster queries these health probes against the Control Plane VirtualMachine objects to determine endpoint health status.

When performing health checks, the VMOP controller validates VirtualMachineImage compliance before adding a node IP address back to the endpoint object. If the VMImage of the Guest Cluster Control Plane VM doesn't match the VMImage expected in the KubeadmControlPlane template, the Control Plane node IP addresses will not be added to the control-plane endpoint by VMOP. This occurs despite healthy etcd and api-server responsiveness on the Control Plane nodes when pointing to the local IP address for kubectl commands.

The mismatching VMImage preventing Control Plane node IPs from being added back to the control-plane endpoint can only be corrected with new node rollouts. This creates a deadlock condition: existing healthy Control Plane nodes cannot be marked healthy for endpoint update due to the image mismatch, but new Control Plane nodes with the correct VMImage cannot complete kubeadm join operations without existing node IPs in the endpoint.

Resolution

Workaround:

WARNING: This workaround temporarily blocks VMOP from automatically adding new Control Plane node IPs to the Endpoints on Supervisor Cluster. You must restore the patch privilege after new nodes join the cluster.

  1. Verify Guest Cluster Control Plane nodes respond to kubectl commands by connecting via SSH to each Control Plane node:

    1. Gather the local node IP address:

      ifconfig

    2. Create a backup of the admin.conf file:

      cp /etc/kubernetes/admin.conf /etc/kubernetes/admin.conf.bak

    3. Edit admin.conf.bak and replace the Server VIP with the local node IP:

      vi /etc/kubernetes/admin.conf.bak

    4. Export the modified kubeconfig:

      export KUBECONFIG=/etc/kubernetes/admin.conf.bak

    5. Test kubectl functionality against the local node.

  2. Connect to the Supervisor Cluster via SSH.

  3. Back up the original VMOP manager cluster role:

    kubectl get clusterrole vmware-system-vmop-manager-role -o yaml > cluster-role-vmware-system-vmop-manager-role.yaml

  4. Remove the patch privilege from the VMOP manager role:

    kubectl edit clusterrole vmware-system-vmop-manager-role

    Remove "- patch" from the endpoints resource verbs section, leaving:

    - apiGroups:
       - ""
       resources:
       - endpoints
       verbs:
       - create
       - delete
       - get
       - list
       - update
       - watch

  5. Manually add the Guest Cluster Control Plane node IPs to the endpoints:

    kubectl -n <GC_NAMESPACE> patch endpoints <GC_NAME>-control-plane-service -p='{"subsets":[{"addresses":[{"ip":"<GC_NODE_IP>"},{"ip":"<GC_NODE_IP>"}],"ports":[{"name":"apiserver","port":6443,"protocol":"TCP"}]}]}'

    Example for Guest Cluster "tkc-01" in namespace "test-01" with node IPs 10.10.5.5 and 10.10.5.6:

    kubectl -n test-01 patch endpoints tkc-01-control-plane-service -p='{"subsets":[{"addresses":[{"ip":"10.10.5.5"},{"ip":"10.10.5.6"}],"ports":[{"name":"apiserver","port":6443,"protocol":"TCP"}]}]}'

  6. Verify the endpoints were updated successfully:

    kubectl -n <GC_NAMESPACE> get endpoints <GC_NAME>-control-plane-service -o yaml

  7. Monitor new Control Plane node rollout if applicable.

  8. CRITICAL: Once a new Control Plane node joins the cluster and reaches Ready status:
    1. , immediately restore the patch privilege to the VMOP manager role:

      kubectl edit clusterrole vmware-system-vmop-manager-role

    2. Add "- patch" back to the endpoints resource verbs.

Additional Information

The Guest Cluster Control Plane health probes are defined by CAPI (Cluster API) and use port 6443, which corresponds to the kube-api server on each VM. The VMOP controller on the Supervisor Cluster queries these health probes against the Control Plane VirtualMachine objects to determine endpoint health status.

When performing the VirtualMachineImage compliance check, the VMOP controller validates that the VMImage of each Guest Cluster Control Plane VM matches the VMImage specified in the KubeadmControlPlane template. This validation occurs as part of the process for adding node IP addresses back to the endpoint object after a health probe failure or recovery event.

Important Notes:

• The workaround creates a temporary state where the VMOP controller cannot automatically manage Control Plane endpoints. This is intentional to break the circular dependency but must be reversed once new nodes successfully join the cluster.

• Control Plane nodes may show as healthy when accessed directly via their local IP addresses (responding to kubectl commands and having functional etcd/api-server), but still fail the VMImage compliance check that prevents endpoint registration.

• This issue specifically manifests when ALL Control Plane nodes simultaneously enter NotReady status while duplicate VirtualMachineImages are being removed, creating the conditions for the VMImage mismatch.