VolumeAttachment Failed to Pod as it was unable to be staged to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc.../globalmount

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Pod stuck in ContainerCreating state with error message that MountVolume.Setup failed for its associated persistentvolumeclaim (pvc).

While connected to the cluster context where the pod is trying to run, the following symptoms are present:

-Performing a describe on the ContainerCreating pod returns an error message similar to the below where the pvc and volume ID will vary based on environment:

Warning  FailedMount ##s (x# over ##s)  kubelet     MountVolume.SetUp failed for volume "pvc-i-am-an-example-pvc" : rpc error: code = FailedPrecondition desc = volume ID: "this-is-an-example-volume-id-string" does not appear staged to "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-i-am-an-example-pvc/globalmount"

-Describing the volumeattachment associated with the above noted pvc shows that it is attached True

-"kubectl get volumeattachment -A | grep <pvc>" can be used to locate the volumeattachment

-The vsphere-csi-controller pods are healthy in Running state on both the Supervisor and affected cluster

-The vsphere-csi-node pod on the same worker node as the pod in ContainerCreating state shows the same "does not appear to be staged" error message for the same pvc

-"kubectl get pods <pod name> -n <pod namespace> -o wide" can be used to find the worker node that this ContainerCreating pod is running on

While directly connected (SSH) to the worker node where the pod is trying to run, the following symptoms are present:

-Performing a ls on the globalmount path from the above "does not appear staged" error message returns an "Input/Output error"

-On a node with a healthy filesystem, the noted globalmount path contains directories for containers of pods on the node.

Environment

vSphere with Tanzu 7.0

vSphere with Tanzu 8.0

This can occur on a vSphere Kubernetes cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

This is indicative of a filesystem issue on the worker node that the pod in ContainerCreating state is attempting to start on.

The system is unable to properly stage the globalmount directory for the pod's container(s) due to this filesystem issue.

This globalmount directory staging action is necessary for volume attachment and mount setup when starting up a pod.

Resolution

The pod in ContainerCreating state will need to be moved to another worker node with a healthy filesystem.

While connected to the cluster context where the pod is trying to run:

Locate the node that the ContainerCreating pod is currently running on:
- ```
kubectl get pod <pod name> -n <pod namespace> -o wide
```

Cordon the node:

kubectl get nodes

kubectl cordon <node name that the ContainerCreating pod is running on>

Confirm that the node now shows "SchedulingDisabled" state:
- ```
kubectl get nodes
```
Delete the ContainerCreating pod to recreate it on a different worker node with a healthy filesystem:
- ContainerCreating state means that the pod has yet to start up its containers. Thus. there is no impact from deleting this pod which has yet to start.
- ```
kubectl delete pod <pod name> -n <pod namespace>
```
Confirm that the pod is now in Running state on a different worker node:
- ```
kubectl get pod <pod name> -n <pod namespace> -o wide
```
If the pod does not reach Running state, describe the pod and volumeattachment for more information:
- ```
kubectl describe pod <pod name> -n <pod namespace>
```
- ```
kubectl get volumeattachment | grep <pvc associated with the pod>
```
- ```
kubectl describe volumeattachment <volumeattachment name>
```
- Note: A Multi-Attach error may appear during start up of the pod on the other worker node. This is because the volumeattachment is moving from the previous worker node to the new worker node that the pod is now starting up on. This error should not persist beyond the initial start up of the recreated pod.
- It is possible that there are multiple worker nodes with this globalmount staging error in the cluster. More than one worker node may need to be cordoned and the pod recreated multiple times to get it to run on a worker node with a healthy filesystem.
The filesystem on the worker node that was cordoned will need to be investigated further regarding the Input/Output error affecting its globalmount directory.