Failed to attach cns volume with error "the resource volume is in use"

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

An existing pod is restarting on a new Worker node but is failing as it cant attach the persistent volume.

It fails with "the resource volume is in use" as outlined below.

Warning  FailedAttachVolume  10s (x174 over 3m39s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-########-d570-####-####-############" : rpc error: code = Internal desc = failed to attach disk: "########-7ba5-####-####-###########" with node: "########-895f-####-####-########" err failed to attach cns volume: "########-7ba5-####-####-############" to node vm: "VirtualMachine:vm-1 [VirtualCenterHost: host-1, UUID: ########-98ea-####-####-############, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-1, VirtualCenterHost: host-1]]". fault: "(*types.LocalizedMethodFault)(0xc000e863e0)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (*types.ResourceInUse)(0xc000e91040)({\n  VimFault: (types.VimFault) {\n   MethodFault: (types.MethodFault) {\n    FaultCause: (*types.LocalizedMethodFault)(<nil>),\n    FaultMessage: (]types.LocalizableMessage) <nil>\n   }\n  },\n  Type: (string) \"\",\n  Name: (string) (len=6) \"volume\"\n }),\n LocalizedMessage: (string) (len=32) \"The resource 'volume' is in use.\"\n})\n"

Environment

TKGi with CSI volumes

Cause

There are 2 volumeattachments for the PV, one showing that the PV is already attached to another node.

# kubectl get volumeattachment | grep pvc-########-d570-####-####-##########
NAME                                                        ATTACHER                 PV                                         NODE                                            ATTACHED   AGE
csi-3695#########################   csi.vsphere.vmware.com   pvc-########-d570-####-####-############   ########-6da8-####-####-############   true       100d
csi-136c#########################   csi.vsphere.vmware.com   pvc-########-d570-####-####-############   ########-895f-####-####-############   false       10m

The csi-attacher shows that it is failing to detach the volume as the VM is disconnected:

Error processing "csi-3695#########################": failed to detach: rpc error: code = Internal desc = failed to detach disk: "########-7ba5-####-####-############" from node: "########-6daf-####-####-############" err failed to detach cns volume: "########-7ba5-####-####-############" from node vm: VirtualMachine:vm-2 [VirtualCenterHost: host-2, UUID: #########-6da8-####-####-############, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-1, VirtualCenterHost: host-2]]. fault: (*types.LocalizedMethodFault)(0xc000c8c760)({
 DynamicData: (types.DynamicData) {
 },
 Fault: (*types.HostNotConnected)(0xc000c8c7a0)({
  HostCommunication: (types.HostCommunication) {
   RuntimeFault: (types.RuntimeFault) {
    MethodFault: (types.MethodFault) {
     FaultCause: (*types.LocalizedMethodFault)(<nil>),
     FaultMessage: ([]types.LocalizableMessage) <nil>
    }
   }
  }
 }),
 LocalizedMessage: (string) (len=69) "Unable to communicate with the remote host, since it is disconnected."
})
, opId: "08e6ac14"

On vSphere UI, the VM is in a disconnected state. All other VMs on the ESX Host are in disconnected state and ESX host is not in a healthy state.

Resolution

Engage ESX team to identify why ESX Host is in faulty state and VMs are disconnected