Workload cluster csi containers stuck in CrashloopBackoff. "Failed to connect to the CSI driver"
search cancel

Workload cluster csi containers stuck in CrashloopBackoff. "Failed to connect to the CSI driver"

book

Article ID: 438609

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

vSphere-csi-controller, vsphere-syncer and liveness probe containers running in vmware-system-csi namespace.

csi-attachter, csi-provisioner, csi-resizer and csi-snapshotter containers stuck in CrashloopBackoff.

csi-attacher, csi-resizer  logs contain entries similar to:

YYYY-MM-DDTHH:MM:SS.MSZ stderr F I0410 HH:MM:SS.MS       1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
YYYY-MM-DDTHH:MM:SS.MSZ stderr F E0410 HH:MM:SS.MS       1 main.go:156] "Failed to connect to the CSI driver" err="context deadline exceeded" csiAddress="/csi/csi.sock"

vsphere-csi-controller logs contain entries similar to:

YYYY-MM-DDTHH:MM:SS.MSZ stderr F E0410 HH:MM:SS.MS       1 reflector.go:205] "Failed to watch" err="failed to list topology.tanzu.vmware.com/v1alpha1, Resource=zones: zones.topology.tanzu.vmware.com is forbidden: User \"system:serviceaccount:namespacename:XXX-XXX-pvcsi\" cannot list resource \"zones\" in API group \"topology.tanzu.vmware.com\" in the namespace \"XXXX-NS\"" logger="UnhandledError" reflector="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:290" type="topology.tanzu.vmware.com/v1alpha1, Resource=zones"

Environment

vSphere with Tanzu 8.0

Cause

Permissions / roles missing for the user that is executing the command on the pod. 

Resolution

Execute the following steps against the Supervisor Cluster to grant the required permissions to the CSI service account.

1. Apply RBAC Patch Create a Role and RoleBinding to grant the missing topology permissions within the target Supervisor namespace.

Save the following as pvcsi-rbac-patch.yaml:


apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pvcsi-topology-reader
  namespace: namespacename
rules:
- apiGroups: ["topology.tanzu.vmware.com"]
  resources: ["zones"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pvcsi-topology-reader-binding
  namespace: namespacename
subjects:
- kind: ServiceAccount
  name: XXX-XXX-pvcsi
  namespace: namespacename
roleRef:
  kind: Role
  name: pvcsi-topology-reader
  apiGroup: rbac.authorization.k8s.io


2.Apply the configuration to the Supervisor Cluster:

kubectl apply -f pvcsi-rbac-patch.yaml


3.Restart the Component Switch context to the Guest Cluster and force the CSI controller pods to restart.

This will clear the active error loop and force the pods to re-authenticate with the updated RBAC token.

kubectl rollout restart deployment vsphere-csi-controller -n vmware-system-csi

4.Verify Resolution Monitor the logs of the newly spawned vsphere-csi-controller pods in the Guest Cluster to confirm the watch operation succeeds.

kubectl logs -l app=vsphere-csi-controller -c csivsphere -n vmware-system-csi --tail=50

Additional Information

Apply Default Pod Security Policy to TKG Service Clusters