CNS Volume Attachment issues on Kubernetes Worker Nodes configured with PCI Passthrough devices

search cancel

CNS Volume Attachment issues on Kubernetes Worker Nodes configured with PCI Passthrough devices

book

Article ID: 427499

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Kubernetes Pods running on OpenShift or vSphere Kubernetes Worker Nodes fail to attach to Persistent Volume Claims (PVCs) after the worker node VM is abruptly powered off or restarted on a different ESXi host. This issue specifically impacts Virtual Machines configured with PCI Passthrough devices that were forced to power off during a failed vMotion attempt.
Attempts to place an ESXi host into Maintenance Mode hang or fail because a worker node VM with a PCI Passthrough device cannot be migrated and then the VM is manually powered off to force the host into Maintenance Mode.
Upon powering the worker node VM on the same or a different host, pods remain in a ContainerCreating or Pending state.

vsphere-csi-controller pod logs shows "Node not found" errors referenced by the Node UUID:

Readonly:false Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1769022798126-6281-csi.vsphere.vmware.com type:vSphere CNS Block Volume] ###_NoUnkeyedLiteral:{} ###_unrecognized:[] ###_sizecache:0}","TraceId":"d#######-####-####-####-########13"}
{"level":"error","time":"YYYY-MM-DDTHH:MM:SS","caller":"node/manager.go:172","msg":"Node not found with nodeName 4######-####-####-####-########b"

The cluster is brought back to a normal state by recreating the worker node.

Environment

VMware vCenter Server
VMware vSphere Kubernetes Service

Cause

This issue is caused by the inherent limitation of PCI Passthrough devices, which bind a VM directly to the physical hardware of a specific ESXi host because of which VM cannot be live migrated.
When the VM is abruptly powered off to bypass the hang, the vSphere Container Storage Plug-in i.e. CSI, does not receive a graceful detach signal and when the VM is powered on again, the CSI controller still reference the old node UUID and believes that the volume is still attached to the old node.

Resolution

Follow the below Considerations and Best Practices for GPU enabled workloads:

Avoid PCI passthrough for general purpose Worker Nodes if there a lot of requirements for vMotion of the VM and host downtime as PCI Passthrough ties the VM to the physical host hardware and the VM must be fully powered off to move to another host, which disrupts CNS storage attachments and Kubernetes scheduling.
Recommended practice is to use vGPU on the Worker Nodes instead of PCI Passthrough as vGPU profiles abstract the hardware, allowing the VM to be vMotioned to other compatible hosts without powering off and CNS volumes remain properly attached during the migration.

Refer to the following documentation for more information on how to configure NVIDIA GPU devices on Kubernetes workloads: Configuring NVIDIA GPU devices for Kubernetes Cluster.

Additional Information

If the environment requires PCI Passthrough on the worker nodes, the following operational procedure should be considered to put the ESXi host into Maintenance Mode:

Drain the Kubernetes worker node gracefully and wait for all the pods to terminate and volume to detach. Refer to the following Kubernetes documentation on how to drain a node: Safely Drain a Node.
Gracefully shutdown the VM and then proceed with the maintenance mode of ESXi host.

Feedback

thumb_up Yes

thumb_down No