How to fix Persistent Volume stuck deleting in Easy Supervisor Cluster with NSX
search cancel

How to fix Persistent Volume stuck deleting in Easy Supervisor Cluster with NSX

book

Article ID: 417550

calendar_today

Updated On:

Products

VMware Data Services Manager

Issue/Introduction

The Persistent Volume Claim (PVC) and Pod in the workload cluster are deleted but Persistent Volume (PV) is stuck in deleting state with "Released" status, even though the reclaim policy is set to "Delete". 

For example, d
atabase cluster status will show up as follows (i.e. persistent volume still exists when it should have been removed):

status:
  alertLevel: WARNING
  conditions:
  - lastTransitionTime: "2025-07-29T23:53:07Z"
    message: ""
    observedGeneration: 2
    reason: Deleting
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-07-29T23:54:23Z"
    message: |-
      waiting for volumes to be removed All attempts fail:
      #1: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
      #2: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
      #3: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
      #4: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
      #5: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
    observedGeneration: 2
    reason: Failed
    status: "False"
    type: Provisioning
  - lastTransitionTime: "2025-07-29T23:31:53Z"


In workload cluster the PVC is gone, but the PV still exists: 

kubectl get pv
NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                              STORAGECLASS           VOLUMEATTRIBUTESCLASS   REASON   AGE
persistentvolume/pvc-########-####-####-9454-############   20Gi       RWO            Delete           Released   new-namespace/#####-######-######-#####-######-#  <storage class>   <unset>                          13h

Environment

VMware Data Services Manager 9.x

Cause

A PV in a workload cluster has a corresponding PVC in the Supervisor. The deletion of a PV in a workload cluster depends on the deletion of the corresponding Supervisor PVC to work properly. In some rare cases, the Supervisor PVC fails to respond to its workload cluster PV deletion event.

Resolution

We need to execute different commands from 3 different environment:

  1. Workload cluster
  2. Supervisor
  3. DSM Provider VM  

(Please ensure the correct command is executed in the correct environment)

 

1) In the workload cluster:

We need to find the corresponding Supervisor PVC of the stuck PV. Execute `kubectl get` command against workload cluster. From the output, the `volumeHandle` is the field we are looking for, it is the name of the corresponding Supervisor PVC.

 kubectl get persistentvolume/pvc-########-####-####-9454-############ -oyaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  creationTimestamp: "2025-07-29T23:41:34Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/csi-vsphere-vmware-com
  name: pvc-########-####-####-9454-############ 
  resourceVersion: "5508"
  uid: ########-####-####-####-############ 
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 20Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: #####-######-######-#####-######-#
    namespace: <namespaceName>
    resourceVersion: "3153"
    uid: ########-####-####-9454-############
  csi:
    driver: csi.vsphere.vmware.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1753832251571-754-csi.vsphere.vmware.com
      type: vSphere CNS Block Volume
    volumeHandle: ########-####-####-b3f4-############-########-####-####-9454-############
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - domain-c10
  persistentVolumeReclaimPolicy: Delete
  storageClassName: dsm-test-latebinding
  volumeMode: Filesystem
status:
  lastPhaseTransitionTime: "2025-07-29T23:53:12Z"
  phase: Released


In this example, it is `########-####-####-b3f4-############-########-####-####-9454-############`. 

 

In the Supervisor:

Then we need to switch to the Supervisor environment, and delete the PVC from there.

(using above example name here, you need to change the PVC name accordingly)

kubectl delete pvc ########-####-####-b3f4-############-########-####-####-9454-############

 

In the workload cluster:

The next step is to restart the vsphere-csi-controller, we need to execute the following command in workload cluster again.

kubectl rollout restart deployment/vsphere-csi-controller  -nvmware-system-csi

 


In the DSM Provider VM:

The provisioner running in ProviderVM will keep trying the reconciliation. After finishing the above steps, if you wait for a certain period of time(such as 10-20mins), you should see the database cluster be cleaned completely.

If not, ssh into the Provider VM, from there we can restart the provisioner process with below command. That will trigger the reconciliation immediately.

systemctl restart dsm-tsql-provisioner

 

After the above steps, the PV stuck in deleting state should be cleaned up completely.