vSphere PVC attachment to application PODs fails and stays in pending State.
Environment
TCA 2.3 TKG 2.x
Cause
The secret named "xxxx-vsphere-csi-addon" in the name space "workload_cluster_namespace" within the management cluster had the values "null" for the region and zone parameters which caused the pods failure and PVC in the pending state. To determine the root cause why the region and zone parameters in the secret were cleared to value null, consider the following options:
Option No 1: Manual deletion of the secret:
Option No 2: Values set to null during upgrade:
Option No 3: Values not set since the initial deployment:
It is most likely that the values were not set since the deployment of the cluster.
In Kubernetes, the vSphere Container Storage Interface (CSI) driver can provision Persistent Volumes (PVs) with or without topology awareness. Here’s an explanation of the differences between topology-aware and non-topology-aware volume provisioning:
Non-Topology Aware Volume:Volumes are created without specific node or zone constraints. PVs can be accessed from any node in the cluster, provided the storage backend supports it. StorageClasses for non-topology-aware provisioning do not use the allowedTopologies field in the yaml configuration. Example StorageClass yaml configuration for non topology aware volume provisioning:
Topology-Aware Volume:Topology-Aware Provisioning involves creating PVs that are aware of the underlying physical or logical topology of the infrastructure. This ensures that the storage resources are optimally aligned with the nodes where the pods are scheduled. Volumes are provisioned with specific node or zone preferences and configurations are also done at vCenter level. StorageClasses for Topology-aware provisioning uses the allowedTopologies field in the yaml configuration. Example StorageClass yaml configuration for Topology aware volume provisioning:
To use the topology-aware volume feature of the vSphere CSI Driver, it's crucial to configure the Zone and Region parameters during the deployment of the vSphere CSI addon, as these are stored in the addon’s secret. If these parameters are not set, they default to null. While non-topology-aware PV provisioning will still function, with a storage class that does not include the allowedTopologies field, topology-aware volume provisioning will face issues. To resolve this, manually set the Zone and Region values in the vSphere CSI addon secret within Kubernetes.
Refer to vSphere CSI 2.0 document for further details and understanding of the reasoning.
Option No 4: vSphere CSI Bug:
The vSphere CSI Driver version 2.6 was released in Nov 2023. Since then, There are no known issues or reported bugs where the values of the region/zone were automatically deleted and set to null.
Resolution
Workaround:
Update the missing values for the region and zone parameters and the vSphere PVC will enter the bound state, resolving the issue.
Additional Information
To provide a more accurate and effective Root Cause Analysis , gather the logs and exact timestamps (or approx date/time if exact not known) when the issue was observed and when the PVC last functioned correctly.
Output of below commands:
kubectl describe pod -A -l app=vsphere-csi-controller | grep -i driver
There are 6 containers running in vsphere-csi-controller pod. csi-provisoner, csi-attacher, csi-external-resizer, vsphere-csi-controller, csi-livenessprobe and vsphere-syncer.
Collect the logs from all containers as per commands below
kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space>
kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space> --previous
kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space> -c vsphere-csi-controller
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-provisoner
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-attacher
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-external-resizer
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-livenessprobe
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c vsphere-syncer
kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space> -c vsphere-csi-controller --previous
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-provisoner --previous
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-attacher --previous
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-external-resizer --previous
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-livenessprobe --previous
kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c vsphere-syncer --previous
kubectl describe/logs/get of the pod where PV attachment is failing
kubectl describe/logs of the vSphere CSI Node pod from the specific Node the application pod is scheduled (Pod where PV attachment failed)
Kubectl get/describe nodes -A -o wide in workload cluster
kubectl get/describe sc/pv/pvc
kubectl get/describe sc/pv/pvc -o yaml
kubectl get secrets -A grep csi
kubectl get secret workload_cluster_name-vsphere-csi-addon -n name space -o yaml
kubectl get secret workload_cluster_name-vsphere-csi-addon -n name space -o jsonpath={.data. "values\.yaml"} | base64
kubectl get secret vsphere-csi-secret -n name space -o yaml
vCenter Server Log bundle should taken around the same time as the the CSI pod logs to ensure that any issues between the CSI logs and the vsan logs on the VC can be tracked and correlated.
Collect the TKG Crashd Bundle.Include the control plane nodes of both management cluster and workload cluster.Also include the worker nodes where applications pods were scheduled with failing PVC.
Collect TCA Manager & TCA CPs log bundle. Do select DB Dump and Kubernetes logs of both relevant management/workload clusters.