vSphere Kubernetes Cluster Windows Pod with PVC Stuck in Unknown State After Windows VM Reboot

Products

VMware vSphere Kubernetes Service vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

After rebooting a Windows worker VM where a Windows-based pod with a PVC was running, the affected Windows-based pod is stuck in Unknown state.

While connected to the affected vSphere Kubernetes cluster's context, the following symptoms are observed:

The affected Windows-based pod is stuck in Unknown state:
- ```
kubectl get pods -A | grep <pod name>
```

Performing a describe on the affected Windows-based pod shows FailedMount error messages similar to the below where the pvc, volume ids and IP addresses will vary by environment:

Warning  FailedMount  ##m                kubelet  MountVolume.WaitForAttach failed for volume "<pvc-name>" : volume <volume id> has GET error for volume attachment <csi-volume-id>: Get "https://<IP.ADDRESS.A>:6443/apis/storage.k8s.io/v1/volumeattachments/<csi-volume-id>": read tcp <IP.ADDRESS.B>:<PORT>-><IP.ADDRESS.A>:6443: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Warning  FailedMount  ##m (x# over ##m)  kubelet  MountVolume.MountDevice failed for volume "<pvc-name>" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name csi.vsphere.vmware.com not found in the list of registered CSI drivers

Warning  FailedMount  ##m                kubelet  MountVolume.MountDevice failed for volume "<pvc-name>" : rpc error: code = Internal desc = error mounting volume. Parameters: {<volume id> ntfs \var\lib\kubelet\plugins\kubernetes.io\csi\csi.vsphere.vmware.com\<container-id>\globalmount [] false} err: rpc error: code = Unknown desc = error mount volume to path. cmd: Get-Volume -UniqueId "$Env:volumeID" | Get-Partition | Add-PartitionAccessPath -AccessPath $Env:mountpath, output: Add-PartitionAccessPath : The requested access path is already in use.

The affected Windows-based pod does not reach Running state even after recreating the pod

Environment

vSphere with Tanzu 8.0

Cause

The CSI driver responsible for mounting volumes is attempting to perform two volume mount actions on the affected Windows-based pod.

error mount volume to path. cmd: Get-Volume -UniqueId "$Env:volumeID" | Get-Partition | Add-PartitionAccessPath -AccessPath $Env:mountpath, output: Add-PartitionAccessPath : The requested access path is already in use.

The Windows API is rejecting the CSI driver's request to mount the volume a second time.

Currently, the CSI controller is unable to detect if a PVC is already mounted to the Windows node after the node was rebooted.

Resolution

There are different workarounds depending on if the vSphere Kubernetes cluster was created with one worker node or multiple worker nodes.

For Clusters with a Single Node

This workaround is for clusters with a single worker node.

The worker node-pool will need to be scaled up to allow for the Windows-based pod to be created on a different node.

Connect into the Supervisor cluster's context
Scale up the worker nodepool by following the below Techdocs:
- Kubectl CLI Scaling
Allow for the new worker node(s) to reach Running state.
Perform the steps below under "For Clusters with Multiple Nodes"

For Clusters with Multiple Nodes

This workaround is for clusters where there are multiple worker nodes.

The affected Windows-based pod will need to be recreated on a different node.

Connect into the affected cluster's context
Locate which node that the affected Windows-based pod is stuck in Unknown state on:
- ```
kubectl get pods -A -o wide | grep <pod name>
```
Cordon the above noted node:
- ```
kubectl cordon <above node>
```
Delete the affected Windows-based pod to recreate it on a different node:
- ```
kubectl delete pod -n <pod namespace> <pod name>
```
Confirm that the affected Windows-based pod reaches Running state on a different node:
- ```
kubectl get pods -A -o wide | grep <pod name>
```
Uncordon the previous node that the Windows-based pod was running on:
- ```
kubectl uncordon <original node>
```