Symptoms:
VMware Tanzu Kubernetes Grid versions 1.x to 1.6.x
Due to race conditions between detaching and deleting volume operations, CNS volumes never get detached from the nodes. One of the scenarios in which this can occur is during an upgrade and if you have stateful workloads utilizing persistent volumes:
Repeated error messages "Failed to find VirtualMachine for node" are logged in the vSphere CSI controller logs:
kubectl logs -n kube-system vsphere-csi-controller-####8d87c-wsml9 vsphere-csi-controller
{"level":"error","time":"2020-09-27T18:15:59.174108121Z","caller":"vanilla/controller.go:569","msg":"failed to find VirtualMachine for node:\"cluster-md-0-5bb7dc9f5c-mbjwl\".
Error: node wasn't found","TraceId":"a5ab0f92-a59e-4b67-9185-a9bd020cc1fb","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume/build/pkg/csi/service/vanilla/controller.go:569
github.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1
/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200
github.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume
/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141
The application pods are stuck during the container creation or init container phase and constantly error out with "Multi-Attach error for volume and Unable to attach or mount volumes: unmounted volumes":
kubectl describe <your-failing-pod>
Warning FailedAttachVolume 32m attachdetach-controller Multi-Attach error for volume "pvc-########-658b-4548-ac7c-134fa73df4c2" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 12m (x2 over 21m) kubelet, cluster-md-0-7f67dbbfb8-lthnt Unable to attach or mount volumes: unmounted volumes=[wordpress-persistent-storage], unattached volumes=[istio-certs default-token-wdbzv wordpress-persistent-storage istio-envoy]: timed out waiting for the condition
Known issue as per VMware-vSphere-Container-Storage-Plug-in
Workaround:
Check the status of the pods
kubectl get pods -A | grep -v Running NAME READY STATUS RESTARTS AGE pod/web-0 0/2 Init:0/1 0 9h pod/wordpress-6c6794cb7d-cdnsc 0/2 Init:0/1 0 31m pod/wordpress-mysql-756d555798-gtvvp 0/2 Init:0/1 0 9h
Query the existing volumeattachments and compare the node names to the nodes from the output above using -
kubectl get volumeattachments.storage.k8s.io
On comparing the above two outputs, it is clear that there are certain volumeattachment objects which refer to nodes that are no longer part of the cluster. You need to delete these attachments to workaround the multi attach errors. Before deleting the volume attachments, please make sure attachments of only those nodes are deleted that are not part of the kubectl get nodes command's output.
To delete the attachments, remove the finalizer from all the volumeattachments that belonged to older nodes.
kubectl patch volumeattachments.storage.k8s.io csi-<uuid> -p '{"metadata":{"finalizers":[]}}' --type=merge
The new volume attachments should soon be created and new nodes will be able to mount the persistent volumes.