vsphere-csi-controller in CrashLoopBackOff due to vsphere-syncer container crashing with "nil pointer dereference"
search cancel

vsphere-csi-controller in CrashLoopBackOff due to vsphere-syncer container crashing with "nil pointer dereference"

book

Article ID: 434766

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VMware vCenter Server

Issue/Introduction

1x vsphere-csi-controller pod is constantly in CrashLoopBackOff

root@423c7fe9845f03bca3aa00e328ca200e [ ~ ]# k get pods -n vmware-system-csi -o wide
NAME                                      READY   STATUS             RESTARTS        AGE     IP           NODE                               NOMINATED NODE   READINESS GATES
vsphere-csi-controller-X   7/7     Running            350 (18m ago)   21h     X   X   <none>           <none>
vsphere-csi-controller-X   6/7     CrashLoopBackOff   5 (101s ago)    4m45s   X   X   <none>           <none>
vsphere-csi-controller-X   7/7     Running            337 (13m ago)   21h     X   X   <none>           <none>

The vsphere-syncer pod logs show the container crashing due to a nil pointer dereference

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]
 
goroutine 302 [running]:
sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.calculateVolumeSnapshotReservedForNamespace({0x0, 0x0}, {0x0, 0x0}, 0x0)
        /build/mts/release/bora-X/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/metadatasyncer.go:1322 +0x0
sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.syncStorageQuotaReserved({0x0, 0x0}, {0x0, 0x0}, 0x0)
        /build/mts/release/bora-X/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/metadatasyncer.go:1137 +0x0
created by sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.initStorageQuotaPeriodicSync.func1 in goroutine 301
        /build/mts/release/bora-X/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/metadatasyncer.go:1071 +0x0

A StoragePolicyQuota CRD has a negative used value

apiVersion: cns.vmware.com/v1alpha2
kind: StoragePolicyQuota
...
  - extensionName: snapshot.cns.vsphere.vmware.com
    extensionQuotaUsage:
    - scQuotaUsage:
        reserved: 819Gi
        used: -36Mi     <--- NEGATIVE

Environment

VCF 9.0

Cause

The issue is caused by VolumeSnapshots that have no underlying PVCs as the PVC was deleted. 

Resolution

1. Find out if there is a StoragePolicyQuota CRD with a negative used value with the following command.

kubectl get storagepolicyquotas.cns.vmware.com -A -o json | jq -r '
  .items[] |
  select(
    .status | [.. | .used? | select(. != null) | tostring | startswith("-")] | any
  ) |
  "Namespace: \(.metadata.namespace) | Name: \(.metadata.name)"
'

 

2. Identify unused VolumeSnapshots and their related PVCs

kubectl get volumesnapshots -n <namespace> -o json | jq -r '
  .items[] |
  select(.metadata.deletionTimestamp == null) |
  select(.status == null or .status.readyToUse == null or .status.readyToUse == false) |
  "\(.metadata.name) -> PVC: \(.spec.source.persistentVolumeClaimName)"'

 

3. For each PVC name returned, verify it no longer exists

kubectl get pvc <pvc-name> -n <namespace>

 

4. Delete the VolumeSnapshots with no PVCs

kubectl delete volumesnapshot <snapshot-name> -n <namespace>

If the deletion hangs, cancel it and patch the VolumeSnapshot to remove the finalizer - 
kubectl patch volumesnapshot  <snapshot-name> -n <namespace>  -p '{"metadata":{"finalizers":[]}}' --type=merge