vSphere pods stuck at PodVMAnnotationsMissing or PodVMCreationFailed state in vSphere 8.0u3b and above environment
search cancel

vSphere pods stuck at PodVMAnnotationsMissing or PodVMCreationFailed state in vSphere 8.0u3b and above environment

book

Article ID: 389895

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Deploy vSphere pods stuck at PodVMAnnotationsMissing or PodVMCreationFailed state.
  • kube-scheduler log has admission webhook error for annotate vm uuid:

    Failed to add podVM's annotations to the pod <pod-name> in namespace <namespace name>.
    Error: admission webhook "admission.vmware.com" denied the request: Cannot change VMware system annotation 'vmware-system-vm-uuid'. Will retry

  • wcp-schedext on particular Supervisor node shows could not login vCenter:

    Could not login to vCenter. Error: ServerFaultCode: The object 'vim.VirtualMachine:<vm-xxxx>' has already been deleted or has not been completely created

  • The file /etc/vmware/wcp/wcp-schedext-admission-controller-user-whitelist on that Supervisor is empty.

 

Environment

VMware vSphere with Tanzu 8.0 U3b and above

Cause

 When a fault occurs such as a power off or IO write failure, the file could be truncated.

Resolution

Workaround:

  • Add the following content to the /etc/vmware/wcp/wcp-schedext-admission-controller-user-whitelist on every impacted Supervisor Control Plane node:

    cat <<EOL > /etc/vmware/wcp/wcp-schedext-admission-controller-user-whitelist
    # List of user-prefixes whitelisted by schedext admission controller for
    # creating or updating resources modifying secure annotations or tolerating
    # master/control plane taint.

    kubernetes-admin
    kubeadm
    system:
    sso:wcp-<machine_id>@<sso_domain>
    vmware-system-
    EOL

    Replace <machine_id> with the machine ID of the vCenter. This should be a UUID.
    Run the following command on the Supervisor VM to gather MACHINE_ID:

    grep MACHINE_ID /var/lib/node.cfg

    Replace <sso_domain> with the domain being used by VC's SSO (such as vsphere.local).
    To gather this grep the same file.

    grep SSO_DOMAIN /var/lib/node.cfg

  • Restart wcp-schedext pod running on that Supervisor Control Plane node after the file has been updated:

    root [ ~ ]# crictl ps -a | grep schedext
    <Pod PID>       9 minutes ago        Running             wcp-schedext                 10                  kube-scheduler-<UUID>
    root [ ~ ]# crictl stop <Pod PID> 
    <Pod PID> 

  • Wait for a few minutes and observe the following message appear in wcp-schedext log:

    root [ ~ ]# crictl logs <Pod PID> | grep vCenter
    Successfully connected to vCenter https://<VC-FQDN>:443/sdk


  • Then redeploy the vSphere pods should be successful.

Additional Information