vsphere-syncer in CrashLoopBackOff with "panic: runtime error: invalid memory address or nil pointer dereference"
search cancel

vsphere-syncer in CrashLoopBackOff with "panic: runtime error: invalid memory address or nil pointer dereference"

book

Article ID: 387809

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

The vsphere-csi-controller pod will repeatedly enter a CrashLoopBackOff state.

NAMESPACE             POD NAME                                     READY   STATUS             RESTARTS         AGE
vmware-system-csi     vsphere-csi-controller-xxxxxxx-xxxx          6/7     CrashLoopBackOff   2689 (60s ago)   47d
vmware-system-csi     vsphere-csi-controller-xxxxxxx-xxxx          7/7     Running            2743 (6m10s ago) 83d
vmware-system-csi     vsphere-csi-controller-xxxxxxx-xxxx          7/7     Running            2731 (14m ago)   84d

Describing the pod returns the following errors:

Warning  BackOff  39s  kubelet  Back-off restarting failed container vsphere-syncer in pod vsphere-csi-controller-xxxxxx-xxxxx

The syncer.log file indicates a segmentation fault due to an invalid memory access or nil pointer dereference:

YYYY-MM-DDThh:mm:ss.000Z stderr F {"level":"info","time":"YYYY-MM-DDThh:mm:ss.000Z","caller":"wcp/controller.go:910","msg":"CreateVolume: called with args {Name:pvc-xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx CapacityRange:required_bytes:5368709120 ..."}
YYYY-MM-DDThh:mm:ss.000Z stderr F {"level":"error","time":"YYYY-MM-DDThh:mm:ss.000Z","caller":"wcp/controller.go:940","msg":"file volume provisioning is not supported on a stretched supervisor cluster"}

Error messages in the driver logs related to file volume provisioning:

YYYY-MM-DDThh:mm:ss.000Z stderr F {"level":"error","time":"YYYY-MM-DDThh:mm:ss.000Z","caller":"wcp/controller.go:940","msg":"File volume provisioning is not supported on a stretched supervisor cluster",
    "TraceId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx",
    "stacktrace": "sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/wcp."
}

There also might be PVCs stuck in Pending or Terminating state (which are "ReadWriteMany"):

# k get pvc -A | grep Terminating | wc -l
45
# k get pvc -A | grep Pending | wc -l
19

Validate that customer is using stretched/Three-zone supervisor clusters via vCenter:
(Note: In a vCenter Log Bundle the file commands/wcp-db-dump.py.txt can be reviewed.)

# /usr/lib/vmware-wcp/wcp-db-dump.py | jq ".dump.cluster_db_configs[].desired_config.DeploymentTarget"
{
  "FaultDomainZones": [
    {
      "ID": "vks-c1",
      "ClusterComputeResources": [
        {
          "type": "ClusterComputeResource",
          "value": "domain-c##"
        }
      ]
    },
    {
      "ID": "vks-c2",
      "ClusterComputeResources": [
        {
          "type": "ClusterComputeResource",
          "value": "domain-c##"
        }
      ]
    },
    {
      "ID": "vks-c3",
      "ClusterComputeResources": [
        {
          "type": "ClusterComputeResource",
          "value": "domain-c##"
        }
      ]
    }
  ]
}

Environment

VMware vSphere with Tanzu

Cause

File volume provisioning is not supported on a stretched Supervisor Cluster.

See documentation: Supervisor Storage

Resolution

Delete the PV and PVCs

kubectl -n <namespace> delete pv <pv-name>
kubectl -n <namespace> delete pvc <pvc-name>

Scale down and up the CSI controllers

kubectl scale deployment vsphere-csi-controller --replicas=0 -n vmware-system-csi
kubectl scale deployment vsphere-csi-controller --replicas=3 -n vmware-system-csi