Persistent volumes cannot attach to a new node if previous node is deleted

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

You are upgrading your Tanzu Kubernetes Grid clusters
You have applications running on the cluster using persistent volumes
Your upgrade is in a hung state or failed after a long wait
Your pods that are utilizing persistent volumes are not able to attach persistent volumes

Environment

VMware Tanzu Kubernetes Grid versions 1.x to 1.6.x

Cause

Due to race conditions between detaching and deleting volume operations, CNS volumes never get detached from the nodes. One of the scenarios in which this can occur is during an upgrade and if you have stateful workloads utilizing persistent volumes:

TKG upgrade hung due to misconfigure PodDisruptionBudget(PDB)
After resolving the PDB errors workers get reconciled and old workers get deleted
CSI controller repeatedly tries to detach the volume from a node that is not present
All stateful workloads are stuck in container creation or an init state, depending on which stage the volume is mounted

Repeated error messages "Failed to find VirtualMachine for node" are logged in the vSphere CSI controller logs:

kubectl logs -n kube-system vsphere-csi-controller-####8d87c-wsml9 vsphere-csi-controller

{"level":"error","time":"2020-09-27T18:15:59.174108121Z","caller":"vanilla/controller.go:569","msg":"failed to find VirtualMachine for node:\"cluster-md-0-5bb7dc9f5c-mbjwl\".
Error: node wasn't found","TraceId":"a5ab0f92-a59e-4b67-9185-a9bd020cc1fb","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume/build/pkg/csi/service/vanilla/controller.go:569
github.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1
/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200
github.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume
/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141

The application pods are stuck during the container creation or init container phase and constantly error out with "Multi-Attach error for volume and Unable to attach or mount volumes: unmounted volumes":

kubectl describe <your-failing-pod>

Warning FailedAttachVolume 32m attachdetach-controller Multi-Attach error for volume "pvc-########-658b-4548-ac7c-134fa73df4c2" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 12m (x2 over 21m) kubelet, cluster-md-0-7f67dbbfb8-lthnt Unable to attach or mount volumes: unmounted volumes=[wordpress-persistent-storage], unattached volumes=[istio-certs default-token-wdbzv wordpress-persistent-storage istio-envoy]: timed out waiting for the condition

Resolution

Known issue as per VMware-vSphere-Container-Storage-Plug-in

Workaround:

Deleting volumeAttachement resources should only happen after the node is drained and deleted, otherwise the workload might be impacted
This operation below must be repeated for all pods backed by Persistent Volumes.
This process should happen in parallel when triggering the upgrade with the Tanzu CLI

Check the status of the pods

kubectl get pods -A | grep -v Running

NAME                                   READY   STATUS     RESTARTS   AGE
pod/web-0                              0/2     Init:0/1   0          9h
pod/wordpress-6c6794cb7d-cdnsc         0/2     Init:0/1   0          31m
pod/wordpress-mysql-756d555798-gtvvp   0/2     Init:0/1   0          9h

Query the existing volumeattachments and compare the node names to the nodes from the output above using -

kubectl get volumeattachments.storage.k8s.io

On comparing the above two outputs, it is clear that there are certain volumeattachment objects which refer to nodes that are no longer part of the cluster. You need to delete these attachments to workaround the multi attach errors. Before deleting the volume attachments, please make sure attachments of only those nodes are deleted that are not part of the kubectl get nodes command's output.

To delete the attachments, remove the finalizer from all the volumeattachments that belonged to older nodes.

kubectl patch volumeattachments.storage.k8s.io csi-<uuid> -p '{"metadata":{"finalizers":[]}}' --type=merge

The new volume attachments should soon be created and new nodes will be able to mount the persistent volumes.