Supervisor Cluster Upgrade Stuck After vCenter 8.0 U3h Due to Missing kubectl-plugin-vsphere Image Manifest
search cancel

Supervisor Cluster Upgrade Stuck After vCenter 8.0 U3h Due to Missing kubectl-plugin-vsphere Image Manifest

book

Article ID: 427141

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • After upgrading vCenter Server to 8.0 Update 3h, the Supervisor cluster upgrade stalled with one or more new Supervisor control-plane VMs stuck during configuration. The vCenter Workload Management UI reported:

System error occurred on Master node with identifier NODE-ID.
Details: Base configuration of node NODE-ID failed as a Kubernetes node. See /var/log/vmware-imc/configure-wcp.stderr on control plane node...

  • This prevents Supervisor upgrade completion and blocked vSphere with Tanzu workload operations.
  • Stuck Supervisor Node Logs: /var/log/vmware-imc/configure-wcp.stdout

YYYY-MM-DDTHH:MM:SS Syncing container images of master VM running pods from https://#.#.#.#:6443
{"error": "Exception", "message": ". Failed to sync images: vmware/kubectl-plugin-vsphere:0.1.11.24507035"}
masterproxy-tkgs-plugin pods (svc-tkg-domain-c45 namespace):

Back-off pulling image "localhost:5000/vmware/kubectl-plugin-vsphere:0.1.11.24507035"
ImagePullBackOff

  • /var/log/vmware/upgrade-ctl-cli.log (skopeo sync failure):

Failed to run command: ['/usr/local/bin/skopeo', ... 'docker://#.#.#.#:5000/vmware/kubectl-plugin-vsphere:0.1.11.24507035']
time="YYYY-MM-DDTHH:MM:SSZ" level=fatal msg="initializing source docker://#.#.#.#:5000/...: manifest unknown"

Environment

  • VMware vCenter Server
  • VMware vSphere Kubernetes Service

Cause

  • vCenter 8.0 U3h introduced a version mismatch where the Supervisor upgrade controller expected a specific older image tag:

vmware/kubectl-plugin-vsphere:0.1.11.24507035

  • Healthy Supervisor nodes only had the newer version available:

docker.io/vmware/kubectl-plugin-vsphere:0.1.12.24795027

Simple ctr images tag operations failed to create the required image manifest:

  • masterproxy-tkgs-plugin deployment requires the exact tag during node configuration, blocking configure-wcp completion.

Resolution

Step 1: Creating the Missing Image Manifest

  • When a new Supervisor node is provisioned during an upgrade, it attempts to pull necessary plugins from the internal registry. If the tag exists but the manifest is missing, the pull will fail. You must manually generate this manifest on a healthy node and push it to the local registry.
  • Tag the existing image for the required version:

ctr -n k8s.io images tag \
  docker.io/vmware/kubectl-plugin-vsphere:0.1.12.24795027 \
  localhost:5002/vmware/kubectl-plugin-vsphere:0.1.11.24507035

  • Push the image to generate the manifest:

ctr -n k8s.io images push --plain-http \
  localhost:5002/vmware/kubectl-plugin-vsphere:0.1.11.24507035

  • Verify the local image list:

ctr -n k8s.io images ls | grep kubectl-plugin-vsphere:0.1.11.24507035

Step 2: Remove the Stalled Supervisor VM

  • Once the manifest is available, the "stuck" VM must be redeployed to trigger a fresh configuration attempt.
    • Identify the Agency: In the vSphere UI, navigate to Workload Management > Supervisor > Nodes to find the EAM (ESX Agent Manager) Agency ID associated with the failed node.
    • Delete via EAM: Right-click the stuck node in the inventory and select Delete from Inventory. This method is preferred as it allows the ESX Agent Manager to cleanly reprovision the node while preserving the overall Supervisor state.

Step 3: Monitor the Node recover

  • After the node is replaced, monitor the following components to ensure the upgrade resumes:
  • Run the following command to ensure the plugin pods transition from ImagePullBackOff to Running:

kubectl get pods -n svc-tkg-domain-c45 -l app=masterproxy-tkgs-plugin

  • Monitor the configuration progress on the new node to ensure there are no further synchronization errors:

tail -f /var/log/vmware-imc/configure-wcp.stdout