PVC attachment to application PODs fails due to null Region or Zone parameters
search cancel

PVC attachment to application PODs fails due to null Region or Zone parameters

book

Article ID: 377815

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

vSphere PVC attachment fails and stays in a Pending state when the region/zone parameters in the vSphere CSI configuration are missing or set to null. 

Environment

TCA 2.3
TKG 2.x

Cause

The secret named xxxx-vsphere-csi-addon in the namespace workload_cluster_namespace within the management cluster had the values null for the region and zone parameters, which caused the pods failure and PVC in the pending state. To determine the root cause why the region and zone parameters in the secret were cleared to value null, consider the following options:

  • Option 1: Manual deletion of the secret.
  • Option 2: Values set to null during upgrade.
  • Option 3: Values not set since the initial deployment:
    • It is most likely that the values were not set since the deployment of the cluster.
    • In Kubernetes, the vSphere Container Storage Interface (CSI) driver can provision Persistent Volumes (PVs) with or without topology awareness. Here is an explanation of the differences between topology-aware and non-topology-aware volume provisioning:
      • Non-Topology Aware Volume: Volumes are created without specific node or zone constraints. PVs can be accessed from any node in the cluster, provided the storage backend supports it. StorageClasses for non-topology-aware provisioning do not use the allowedTopologies field in the YAML configuration.

        Example StorageClass YAML configuration for non-topology aware volume provisioning:

        apiVersion: storage.k8s.io/v1
        kind: StorageClass
        metadata:
          name: example-non-topology-aware-sc
        provisioner: csi.vsphere.vmware.com
        parameters:
          datastore: "datastore1"
      • Topology-Aware Volume: Topology-Aware Provisioning involves creating PVs that are aware of the underlying physical or logical topology of the infrastructure. This ensures that the storage resources are optimally aligned with the nodes where the pods are scheduled. Volumes are provisioned with specific node or zone preferences and configurations are also done at vCenter level. StorageClasses for Topology-aware provisioning uses the allowedTopologies field in the YAML configuration.

        Example StorageClass YAML configuration for Topology aware volume provisioning:

        apiVersion: storage.k8s.io/v1
        kind: StorageClass
        metadata:
          name: example-topology-aware-sc
        provisioner: csi.vsphere.vmware.com
        parameters:
          datastore: "datastore1"
        allowedTopologies:
        - matchLabelExpressions:
          - key: failure-domain.beta.kubernetes.io/zone
            values:
            - zone1
    • To use the topology-aware volume feature of the vSphere CSI Driver, you must configure the Zone and Region parameters during the deployment of the vSphere CSI addon, as these are stored in the addon’s secret. If you do not set these parameters, they default to null. While non-topology-aware PV provisioning still functions with a storage class that does not include the allowedTopologies field, topology-aware volume provisioning will face issues. To resolve this, manually set the Zone and Region values in the vSphere CSI addon secret within Kubernetes.
    • Refer to the vSphere CSI 2.0 documentation for further details.
  • Option 4: vSphere CSI Issue:
    • The vSphere CSI Driver version 2.6 was released in November 2023. Since then, there are no known issues or reported bugs where the values of the region/zone were automatically deleted and set to null.

Resolution

Workaround:

Update the missing values for the region and zone parameters and the vSphere PVC will enter the bound state, resolving the issue.

Additional Information

  • To provide a more accurate and effective Root Cause Analysis , gather the logs and exact timestamps (or approx date/time if exact not known)  when the issue was observed and when the PVC last functioned correctly.
  • Output of below commands:
  • kubectl describe pod -A -l app=vsphere-csi-controller | grep -i driver
    There are 6 containers running in vsphere-csi-controller pod. csi-provisoner, csi-attacher, csi-external-resizer, vsphere-csi-controller, csi-livenessprobe and vsphere-syncer.
    Collect the logs from all containers as per commands below
    kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space>
    kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space> --previous 
    kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space> -c vsphere-csi-controller
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-provisoner
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-attacher
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-external-resizer
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-livenessprobe
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c vsphere-syncer
    kubectl logs <Replace with_vsphere-csi-controller-pod_name> -n <replace with actual name space> -c vsphere-csi-controller --previous
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-provisoner --previous
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-attacher --previous
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-external-resizer --previous 
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c csi-livenessprobe --previous 
    kubectl logs <Replace with_vsphere-csi-controller-pod-name> -n <replace with actual name space> -c vsphere-syncer --previous
    kubectl describe/logs/get of the pod where PV attachment is failing 
    kubectl describe/logs of the vSphere CSI Node pod from the specific Node the application pod is scheduled (Pod where PV attachment failed) 
    Kubectl get/describe nodes -A -o wide in workload cluster 
    kubectl get/describe sc/pv/pvc 
    kubectl get/describe sc/pv/pvc -o yaml 
    kubectl get secrets -A grep csi 
    kubectl get secret workload_cluster_name-vsphere-csi-addon -n name space -o yaml 
    kubectl get secret workload_cluster_name-vsphere-csi-addon -n name space -o jsonpath={.data. "values\.yaml"} | base64
    kubectl get secret vsphere-csi-secret -n name space -o yaml
  • vCenter Server Log bundle should taken around the same time as the the CSI pod logs to ensure that any issues between the CSI logs and the vsan logs on the VC can be tracked and correlated.
  • Collect the TKG Crashd Bundle.Include the control plane nodes of both management cluster and workload cluster.Also include the worker nodes where applications pods were scheduled with failing PVC.
  • Collect TCA Manager & TCA CPs log bundle. Do select DB Dump and Kubernetes logs of both relevant management/workload clusters.