The vsphere-csi-controller pods within the Supervisor cluster restart at irregular intervals.
When reviewing the CSI logs, a panic error is observed indicating a segmentation violation and nil pointer reference:panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1f0bc54] goroutine 10490815 [running]: sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/volume.updateQueryResult({0x2d015a8?, 0xc0005c4d20?}, 0x1?, 0x0?) /build/mts/release/bora-24784842/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/volume/util.go:163 +0x34
Additionally, running kubectl get volumesnapshot -A shows snapshots with READYTOUSE set to false. The following warning event may also be seen indicating a volume type mismatchdefault 10m Warning
SnapshotContentCheckandUpdateFailed
volumesnapshotcontent/snapcontent######## "
Failed to
check and update snapshot content: failed to take snapshot of the volume
file:#######: "rpc error: code =
FailedPrecondition desc = queried volume doesn't have the expected
volume type. Expected VolumeType: block. Queried VolumeType: FILE
vSphere Kubernetes Service
vSphere Supervisor
This issue occurs due to an attempt to snapshot ReadWriteMany (RWX) volumes. In vSphere, RWX volumes are provisioned as vSphere file volumes. Only block volumes support volume snapshot and restore operations; these operations cannot be used with vSphere file volumes.
When a snapshot is attempted on a file volume, it fails with a FailedPrecondition error because the queried volume type is FILE instead of the expected block. This unsupported operation leads to a nil pointer dereference in the vsphere-csi-driver, causing the controller pods to panic and restart.
To resolve this issue, you must identify and delete the failed volume snapshots that were attempted against RWX Persistent Volume Claims (PVCs).
Optionally, users can scale down the CSI pods to avoid repeated crashing before starting the snapshot cleanup after completion replicas should be scaled back to 3.
kubectl scale deployment -n vmware-system-csi vsphere-csi-controller replicas 0Follow the steps below from a machine with kubectl and jq access to the cluster:
Identify all PVCs with RWX access mode and save them to a text file:
kubectl get pvc --all-namespaces -o json | \
jq -r '.items[] | select(.spec.accessModes[] == "ReadWriteMany") | "\(.metadata.namespace)/\(.metadata.name)"' > rwx-pvcs.txt
Identify all volume snapshots that failed to complete (readyToUse=false):
kubectl get volumesnapshot --all-namespaces -o json | \
jq -r '.items[] | select(.status.readyToUse == false) | "\(.metadata.namespace)/\(.spec.source.persistentVolumeClaimName) \(.metadata.name)"' > failed-snapshots.txt
Cross-reference the failed snapshots with the RWX PVCs to determine which snapshots must be removed:
grep -F -f rwx-pvcs.txt failed-snapshots.txt | awk '{print $2}' > snapshots-to-delete.txt
Review the list of snapshots targeted for deletion:
cat snapshots-to-delete.txt
Delete the stuck snapshots:
cat snapshots-to-delete.txt | while read snapshot; do
ns=$(echo $snapshot | cut -d'/' -f1)
name=$(echo $snapshot | cut -d'/' -f2)
kubectl delete volumesnapshot $name -n $ns
done
If there are remaining failed snapshots that no longer have a corresponding PVC, you can extract the remaining snapshot names from your failed-snapshots.txt file and delete them:
Review and delete any remaining failed snapshots:
cat failed-snapshots.txt | while read line; do
ns=$(echo $line | awk '{print $1}' | cut -d'/' -f1)
snapshot=$(echo $line | awk '{print $2}')
kubectl delete volumesnapshot $snapshot -n $ns
doneFor further details on snapshot limitations, see the product documentation: Creating Snapshots in VKS Service Clusters.