sphere-csi-controller Supervisor cluster pods restart at irregular intervals during RWX volume snapshot attempts

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

The vsphere-csi-controller pods within the Supervisor cluster restart at irregular intervals.

When reviewing the CSI logs, a panic error is observed indicating a segmentation violation and nil pointer reference:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1f0bc54] goroutine 10490815 [running]:
sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/volume.updateQueryResult({0x2d015a8?, 0xc0005c4d20?}, 0x1?, 0x0?)
/build/mts/release/bora-24784842/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/volume/util.go:163 +0x34

Additionally, running kubectl get volumesnapshot -A shows snapshots with READYTOUSE set to false. The following warning event may also be seen indicating a volume type mismatch
default 10m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent/snapcontent######## Failed to check and update snapshot content: failed to take snapshot of the volume file:#######: "rpc error: code = FailedPrecondition desc = queried volume doesn't have the expected volume type. Expected VolumeType: block. Queried VolumeType: FILE"

Environment

vSphere Kubernetes Service

vSphere Supervisor

Cause

This issue occurs due to an attempt to snapshot ReadWriteMany (RWX) volumes. In vSphere, RWX volumes are provisioned as vSphere file volumes. Only block volumes support volume snapshot and restore operations; these operations cannot be used with vSphere file volumes.

When a snapshot is attempted on a file volume, it fails with a FailedPrecondition error because the queried volume type is FILE instead of the expected block. This unsupported operation leads to a nil pointer dereference in the vsphere-csi-driver, causing the controller pods to panic and restart.

Resolution

To resolve this issue, you must identify and delete the failed volume snapshots that were attempted against RWX Persistent Volume Claims (PVCs).

Optionally, users can scale down the CSI pods to avoid repeated crashing before starting the snapshot cleanup after completion replicas should be scaled back to 3.

kubectl scale deployment -n vmware-system-csi vsphere-csi-controller replicas 0

Follow the steps below from a machine with kubectl and jq access to the cluster:

Identify all PVCs with RWX access mode and save them to a text file:

kubectl get pvc --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.accessModes[] == "ReadWriteMany") | "\(.metadata.namespace)/\(.metadata.name)"' > rwx-pvcs.txt

Identify all volume snapshots that failed to complete (readyToUse=false):

kubectl get volumesnapshot --all-namespaces -o json | \
  jq -r '.items[] | select(.status.readyToUse == false) | "\(.metadata.namespace)/\(.spec.source.persistentVolumeClaimName) \(.metadata.name)"' > failed-snapshots.txt

Cross-reference the failed snapshots with the RWX PVCs to determine which snapshots must be removed:
```
grep -F -f rwx-pvcs.txt failed-snapshots.txt | awk '{print $2}' > snapshots-to-delete.txt
```
Review the list of snapshots targeted for deletion:
```
cat snapshots-to-delete.txt
```

Delete the stuck snapshots:

cat snapshots-to-delete.txt | while read snapshot; do
  ns=$(echo $snapshot | cut -d'/' -f1)
  name=$(echo $snapshot | cut -d'/' -f2)
  kubectl delete volumesnapshot $name -n $ns
done

If there are remaining failed snapshots that no longer have a corresponding PVC, you can extract the remaining snapshot names from your failed-snapshots.txt file and delete them:

Review and delete any remaining failed snapshots:

Bash 
cat failed-snapshots.txt | while read line; do
  ns=$(echo $line | awk '{print $1}' | cut -d'/' -f1)
  snapshot=$(echo $line | awk '{print $2}')
  kubectl delete volumesnapshot $snapshot -n $ns
done

Additional Information

For further details on snapshot limitations, see the product documentation: Creating Snapshots in VKS Service Clusters.