VKS Cluster Pods Stuck Init - failed to attach disk to vm

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

In a VKS cluster, pods using persistent volumes are stuck in Init state.

While connected to the VKS cluster context, the following symptoms are observed:

The affected pod is stuck in Init state:
```
kubectl get pods -n <namespace> -o wide
```

Describing the affected pod show errors similar to the following indicating a time out and failure to attach:

kubectl describe pod <pod name> -n <pod namespace>

FailedAttachVolume: AttachVolume.Attach failed for volume <pv name> : rpc error: code = Internal desc = Watch on virtualmachine timed out
Unable to attach or mount volumes
Timed out waiting for the condition

The persistent volume and persistent volume claim are present and bound for the affected pod:
```
kubectl get pv,pvc -n <namespace>
```
The volumeattachment is created for the persistent volume, pod and on the same node as the affected pod but remains in attached False state:
```
kubectl get volumeattachments
```

Restarting CSI controller pods do not resolve the issue.

Storage DRS is enabled on the datastores or datastore cluster used by the VKS cluster.

When comparing the output of GOVC and VCDB for the affected volume(s), there is a discrepancy in datastore placement for the associated disk.

See the below Workaround - Issue Verification steps for how to verify this.

Environment

vSphere Supervisor

VKS Cluster

Storage DRS enabled - This is unsupported in VKS environments

Cause

Storage DRS is not supported in VKS and vSphere Supervisor environments.

When Storage DRS automatically moves a volume to another datastore, this breaks the connection and CNS management of the volume.

This can result in discrepancies between where VCDB considers the volume to be located and where the volume is actually located.

Resolution

Resolution:

Disable Storage DRS on the datastore cluster for the datastores used by VKS.

This will prevent the issue from happening again, but the below workaround may need to be performed to fix the datastore discrepancy caused by Storage DRS.

Workaround:

Issue Verification

Connect to the VKS cluster context for the stuck pod
Locate the persistent volume associated with the stuck pod:
1. Describe the stuck pod for the human-readable name of the persistent volume:
```
kubectl describe pod -n <namespace> <stuck pod name>
```
2. Find the associated persistent volume and persistent volume claim based on the human-readable name:
```
kubectl get pv,pvc -n <namespace> | grep <human-readable name for the volume>
```
3. Describe the associated persistent volume for its volumeHandle:
```
kubectl describe pv <pvc-ID> | grep -i volumehandle
```
Connect to the Supervisor cluster context
Document the persistent volume associated with the volumeHandle found from the previous step:
```
kubectl get pvc -A | grep -i <volumeHandle from previous step>
```
Note: Persistent volumes (pv) are named "pvc-<ID>"
In a SSH session to the vCenter server appliance (VCSA) VM, query from the VCDB for the above persistent volume name:
```
/opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB -c "select * from cns.volume_info where volume_name = '<pvc-name from Supervisor cluster>';"
```
Document the volume ID, datastore and vmdk from the above output.
Query for the previous step's volume ID in GOVC:
1. Download and set up GOVC onto a machine that has connectivity to vCenter:
  1. See 2. List and delete snapshots from KB Failed to expand volume because the disk that backs it has snapshots
2. Establish environment variables:
```
export GOVC_URL=<vCenter FQDN>
export GOVC_USERNAME=<admin User>
export GOVC_PASSWORD=<admin Password>
export GOVC_INSECURE=true
```
3. Search for disks with the volume ID from VCDB:
```
govc disk.ls -k -dc=<datacenter> -ds=<datastore from VCDB> -l <volume ID from VCDB>
```
  - Ensure that the above information is correct.
  - If the above output returns "ServerFaultCode: The object or item referred to could not be found", this indicates that the volume is no longer in the given datastore. VCDB has a discrepancy with the current datastore location of this volume.
  - Confirm that the volume was moved to a different datastore by changing the datastore in the above GOVC command.
Once you've verified that there's a discrepancy between which datastore the volume is on according to VCDB and GOVC, proceed to the next section.

If there is no discrepancy, you have encountered a different issue than this KB article.

Reconcile the Datastore

Follow the below KB article to reconcile the affected datastores.

Depending on the size of the environment, this can take multiple hours to take effect.

Reconciling Discrepancies in the Managed Virtual Disk Catalog

Additional Information

vSphere DRS is required in Fully Automated mode for VKS and vSphere Supervisor environments.