A pod in the guest cluster is stuck in ContainerCreating state due to a Block/RWO volume attachment failure.
search cancel

A pod in the guest cluster is stuck in ContainerCreating state due to a Block/RWO volume attachment failure.

book

Article ID: 305322

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

 

  • A pod in the guest cluster is stuck in ContainerCreating state.

  • When describing the pod, we see a volume attachment failure event because the CNS fails to retrieve the datastore that is backing the volume.

Warning FailedAttachVolume 18s               attachdetach-controller AttachVolume.Attach failed for volume "pvc-########-####-####-2222-############" : rpc error: code = Internal desc = observed Error: "ServerFaultCode: CNS: Failed to retrieve datastore for vol ########-####-####-0000-############. (vim.fault.NotFound) {\n  faultCause = (vmodl.MethodFault) null, \n  faultMessage = <unset>\n  msg = \"The vStorageObject (vim.vslm.ID) {\n  dynamicType = null,\n  dynamicProperty = null,\n  id = ########-####-####-0000-############\n} was not found\"\n}" is set on the volume "########-####-####-####-############-########-####-####-2222-############" on virtualmachine "tkgs-cluster-1-worker-nodepool-##-####-########-####"

  • The volume metadata is missing from the Pandora database on the vCenter.

root@vcsa1 [ ~ ]# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -c "SELECT * from vpx_storage_object_info where id='########-####-####-0000-############'"
 id | name | capacity | datastore_url | create_time | v_clock | backing_object_id | disk_path | used_capacity
----+------+----------+---------------+-------------+---------+-------------------+-----------+---------------
(0 rows)

 

Environment

  • VMware vSphere 7.0 with Tanzu
  • TKGm: 2.51
 

Cause

  • The volume is present in the backend datastore but missing from Pandora database

  • Due to the para-virtualized architecture of the CSI in the guest clusters, each volume is represented by two names: one is generated by the pvCSI in the guest cluster and the other is generated by the CSI in the supervisor cluster.

    • The latter is the value that is stored in the database. Therefore, don't filter by the volume name that was mentioned in the pod events.

    • Instead, use the volume ID. If the pod events didn't mention that ID, please check the resolution section for more details on how to get it.

Resolution

Resolved in vCenter Server 8.0 Update 3e or above .

Note: The Pandora db is no longer used in vSphere 8.0

Workaround

  1. Identify the volume ID, if it isn't mentioned in the pod events.

  • The guest clusters utilize a para-virtualized CSI.

  • This means the PV on the guest clusters refers to a PVC on the supervisor cluster.

  • This PVC will be bound to a PV refers to the volume ID.

    1. Describe the problematic volume in the guest cluster and get the VolumeHandle.

      $ kubectl describe pv pvc-########-####-####-2222-############ | grep -i VolumeHandle

      Example output:
      VolumeHandle:   ########-####-####-####-############-########-####-####-2222-############

    2. Go to the supervisor cluster and get the PV that is bound to this PVC.

      # kubectl get pvc -A | grep -i ########-####-####-####-############-########-####-####-2222-############

      Example output:
      tanzu-1   ########-####-####-####-############-########-####-####-2222-############  Bound  pvc-########-####-####-3333-############  1Gi    RWO      tanzu     50d

    3. Get the VolumeHandle of this PV, this is our volume ID.

      root@42320f0e4760472d1a96bbbd0bdaa921 [ ~ ]# kubectl describe pv pvc-########-####-####-3333-############  | grep -i VolumeHandle

      Example output:
      VolumeHandle:   ########-####-####-0000-############

      Note: Can also identify the ID from the pvCSI controller logs or the CNS logs (vsanvcmgmtd.log) if the logs have not rolled over.

  1. Given the datastore name that is backing the volume, get the managed object ID (MOID) of that datastore by running this command on the vCenter.

    # dcli com vmware vcenter datastore list | grep -i Tanzu

    Example output:
    |datastore-2009|Tanzu      |VMFS|5272240128 |241323474944|


    In this example, the MOID is datastore-2009

  2. Go to https://<vcenterIp>/mob and log in with the SSO administrator account. 

    1. Go to content > VStorageObjectManager > VCenterUpdateVStorageObjectMetadataEx_Task

    2. Insert the volume ID, the datastore MOID and the metadata KeyValue as seen below

    3. Click Invoke Method.

  3. Make sure that the DB has been updated successfully.

    # /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -c "SELECT * from vpx_storage_object_info where id='########-####-####-0000-############'"

    Example output:
             id         |          name          | capacity |           datastore_url           |    create_time    | v_clock | backing_object_id |           disk_path           | used_capacity
    --------------+-------------------+---------+-----------------------------+----------------+---------+-------------------+---------------------------------------------------+---------------
     ########-####-####-0000-############ | pvc-########-####-####-3333-############ |   1024 | ds:///vmfs/volumes/########-########-1111-############/ | 2022-09-13 21:16:54.459 |   91 |          | [Tanzu] fcd/#####.vmdk |      -1
    (1 row)

  4. Wait 2-3 minutes and check the pod status. If it didn't run, then delete the pod and recreate it.

If the issue persists, then there may be some discrepancies with the disk catalog, refer Reconciling Discrepancies in the Managed Virtual Disk Catalog

Additional Information

Check if the vmdk that is backing the volume exists in the datastore.

  1. Take the first part of the volume ID and separate out the first 8 characters as <## ## ## ##>
  2. SSH to one of the ESXi hosts and search for the vmdk file containing the FCD ID.

    [root@esxi:~] IFS=$'\n'; for i in `find /vmfs/volumes -iname '*.vmdk' -type f |grep -vE "flat|sesparse|delta|rdm|ctk"`; do echo $i; grep 'fcd.uuid' $i ;done | grep -B1 "<## ## ## ##>"
  3. If the vmdk exists, it prints the complete path of the VMDK containing the datastore UUID.

    /vmfs/volumes/########-########-1111-############/fcd/#####.vmdk
    ddb.fcd.uuid = "<## ## ## ##> ## ## ##-## ## ## ## ## ## ##"

Note: Get the datastore name. It will be needed as part of the solution.

[root@esxi:~] localcli storage filesystem list | grep -i ########-########-1111-############
/vmfs/volumes/########-########-1111-############  Tanzu                                      ########-########-1111-############  true     VMFS-6  241323474944   5274337280

In this example, the datastore's name is Tanzu.