How to fix Persistent Volume stuck deleting in Easy Supervisor Cluster with NSX

Products

VMware Data Services Manager

Issue/Introduction

The Persistent Volume Claim (PVC) and Pod in the workload cluster are deleted but Persistent Volume (PV) is stuck in deleting state with "Released" status, even though the reclaim policy is set to "Delete".

For example, database cluster status will show up as follows (i.e. persistent volume still exists when it should have been removed):

status:
alertLevel: WARNING
conditions:
- lastTransitionTime: "2025-07-29T23:53:07Z"
message: ""
observedGeneration: 2
reason: Deleting
status: "False"
type: Ready
- lastTransitionTime: "2025-07-29T23:54:23Z"
message: |-
waiting for volumes to be removed All attempts fail:
#1: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
#2: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
#3: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
#4: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
#5: persistent volume pvc-########-####-####-9454-############ still exists when it should have been removed
observedGeneration: 2
reason: Failed
status: "False"
type: Provisioning
- lastTransitionTime: "2025-07-29T23:31:53Z"

In workload cluster the PVC is gone, but the PV still exists:

kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
persistentvolume/pvc-########-####-####-9454-############ 20Gi RWO Delete Released new-namespace/#####-######-######-#####-######-# <storage class> <unset> 13h

Environment

VMware Data Services Manager 9.x

Cause

A PV in a workload cluster has a corresponding PVC in the Supervisor. The deletion of a PV in a workload cluster depends on the deletion of the corresponding Supervisor PVC to work properly. In some rare cases, the Supervisor PVC fails to respond to its workload cluster PV deletion event.

Resolution

We need to execute different commands from 3 different environment:

Workload cluster
Supervisor
DSM Provider VM

(Please ensure the correct command is executed in the correct environment)

1) In the workload cluster:

We need to find the corresponding Supervisor PVC of the stuck PV. Execute `kubectl get` command against workload cluster. From the output, the `volumeHandle` is the field we are looking for, it is the name of the corresponding Supervisor PVC.

kubectl get persistentvolume/pvc-########-####-####-9454-############ -oyaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
volume.kubernetes.io/provisioner-deletion-secret-name: ""
volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
creationTimestamp: "2025-07-29T23:41:34Z"
finalizers:
- kubernetes.io/pv-protection
- external-attacher/csi-vsphere-vmware-com
name: pvc-########-####-####-9454-############
resourceVersion: "5508"
uid: ########-####-####-####-############
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: #####-######-######-#####-######-#
namespace: <namespaceName>
resourceVersion: "3153"
uid: ########-####-####-9454-############
csi:
driver: csi.vsphere.vmware.com
fsType: ext4
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: 1753832251571-754-csi.vsphere.vmware.com
type: vSphere CNS Block Volume
volumeHandle: ########-####-####-b3f4-############-########-####-####-9454-############
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- domain-c10
persistentVolumeReclaimPolicy: Delete
storageClassName: dsm-test-latebinding
volumeMode: Filesystem
status:
lastPhaseTransitionTime: "2025-07-29T23:53:12Z"
phase: Released

In this example, it is `########-####-####-b3f4-############-########-####-####-9454-############`.

In the Supervisor:

Then we need to switch to the Supervisor environment, and delete the PVC from there.

(using above example name here, you need to change the PVC name accordingly)

kubectl delete pvc ########-####-####-b3f4-############-########-####-####-9454-############

In the workload cluster:

The next step is to restart the vsphere-csi-controller, we need to execute the following command in workload cluster again.

kubectl rollout restart deployment/vsphere-csi-controller -nvmware-system-csi

In the DSM Provider VM:

The provisioner running in ProviderVM will keep trying the reconciliation. After finishing the above steps, if you wait for a certain period of time(such as 10-20mins), you should see the database cluster be cleaned completely.

If not, ssh into the Provider VM, from there we can restart the provisioner process with below command. That will trigger the reconciliation immediately.

systemctl restart dsm-tsql-provisioner

After the above steps, the PV stuck in deleting state should be cleaned up completely.