In a vSphere Kubernetes Cluster, one or more nodes are stuck in Deleting state.
In this scenario, the node is stuck in Deleting state because it is failing to detach/attach volumes from/to another node which uses an image that no longer exists in the environment.
Pods are stuck in Init state, unable to attach volumes because the necessary volumes are still attached to another node in the cluster which is using the missing image.
While connected to the Supervisor cluster context, the following symptoms are present:
kubectl get vm -o wide -n <affected cluster namespace>
NAMESPACE NAME POWERSTATE IMAGE
<namespace> <cluster-control-plane-a> poweredOn <photon-ova-image-1>
<namespace> <cluster-control-plane-b> poweredOn <photon-ova-image-1>
<namespace> <cluster-control-plane-c> poweredOn <photon-ova-image-1>
<namespace> <cluster-worker-node-nodepool-z> poweredOn <photon-ova-image-2>
<namespace> <cluster-worker-node-nodepool-y> poweredOn <photon-ova-image-2>
<namespace> <cluster-worker-node-nodepool-x> poweredOn <photon-ova-image-1>
kubectl get virtualmachineimage
kubectl describe cluster <affected cluster name> -n <affected cluster namespace>
message: 'Failed to get VirtualMachineImage <photon-ova-image-2>:
VirtualMachineImage.vmoperator.vmware.com "<photon-ova-image-2>" not found'
error validating image hardware version for PVC: VirtualMachineImage.vmoperator.vmware.com "<photon-ova-image-2>" not found
Waiting for machine 1 of 1 to be deleted
Waiting for node volumes to be detached
While connected to the affected vSphere Kubernetes cluster's context, the following symptoms are present:
kubectl get nodes
kubectl get pods -A -o wide | grep <deleting node name>
kubectl get volumeattachments -A | grep <deleting node name>
kubectl logs -n vmware-system-csi vsphere-csi-controller -c csi-attacher
error syncing volume "<pvc-pod-persistent-volume>": persistentvolume <pvc-pod-persistent-volume> is still attached to node <deleting node name>
Time out to update VirtualMachines "<deleting node name>" with Error: admission webhook "default.validating.virtualmachine.vmoperator.vmware.com" denied the request: spec.imageName: Invalid value: "<photon-ova-image-2>": error validating image hardware version for PVC: VirtualMachineImage.vmoperator.vmware.com "<photon-ova-image-2>" not found
vSphere with Tanzu 7.0
This issue can occur regardless of whether or not the vSphere Kubernetes Cluster is managed by Tanzu Mission Control (TMC)
Duplicate image issues have been resolved in vSphere with Tanzu 8.0 and higher.
The node is unable to delete because the volumes for the pods which were originally running on the node are unable to detach or attach to another node in the cluster.
CSI is unable to attach or detach volumes because it cannot locate the missing image for the node it is trying to detach volumes from or attach volumes to.
This is caused by a manual change to the content library attached to the affected cluster's namespace or manual removal of the above noted missing image.
Removal of content library or images would not normally trigger this missing image error because the content library service maintains a cache of the images.
In this scenario, the removed content library or image happened so long ago that the image cache entry no longer exists.
This issue can also occur if the missing image was renamed instead of being removed from the content library. It is not supported to rename images in content libraries.
The vmop-controller-manager pod in the Supervisor cluster is responsible for reconciling virtual machine images in the environment and will do so periodically.
The missing image needs to be re-added into the environment. The vmop-controller-manager pod in the Supervisor cluster is responsible for reconciling virtual machine images in the environment and will do so periodically.
IMPORTANT: Changes to content libraries and images used by existing VMs will result in rolling redeployments of all nodes across all clusters using that content library or image in the Supervisor cluster.
kubectl get contentsources -A
kubectl get vm -o wide -n <affected cluster namespace>
kubectl get virtualmachineimage
kubectl rollout restart deploy -n vmware-system-vmop vmop-controller-manager
kubectl get virtualmachineimage
watch kubectl get vm,ma -o wide -n <affected cluster namespace>
Duplicate image issues have been resolved in vSphere with Tanzu 8.0 and higher.