"failed to add disk" error while deploying Supervisor Services or vSphere Pods
search cancel

"failed to add disk" error while deploying Supervisor Services or vSphere Pods

book

Article ID: 385016

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

Failing to install any supervisor service or create vSphere pods. The Supervisor Services pods are stuck.

While connected to the Supervisor cluster context, the following symptoms are observed:

  • The failed supervisor service or vSphere pod is stuck in Pending or ErrImageSetup state:
    kubectl get pods -n <supervisor-service-namespace> -o wide

     

  • Describing the above pod shows the following Events indicating that it successfully pulled the Image, where values in brackets <> will vary by supervisor service:
    kubectl describe pod -n <supervisor service namespace> <failing supervisor service pod>
    
    Events:
      Type    Reason            Age                From               Message
      ----    ------            ----               ----               -------
      Normal  Scheduled         ##s                default-scheduler  Successfully assigned <supervisor service namespace>/<supervisor service pod> to <ESXi host>
      Normal  Image             ##s                image-controller   Image <supervisor service Image> bound successfully
      Normal  SuccessfulUpdate  ##s (x# over ##s)  pod-controller     Pod CR has been successfully updated
      Normal  Pulling           ##s                kubelet            Waiting for Image <supervisor service namespace>/<supervisor service Image>
      Normal  Pulled            ##s                kubelet            Image <supervisor service namespace>/<supervisor service Image>

     

  • Viewing the events for the Supervisor service's namespace may show the below error:
    Imagedisk bind failed: Operation cannot be fulfilled on images.imagecontroller.vmware.com "<Image>": the object has been modified; please apply your changes to the latest version and try again

     

  • Near the top of describing the pod in ErrImageSetup state, Message shows an error similar to the following:
    Status: Failed
    Reason: ErrImageSetup
    Message: failed to setup images: failed to add disk [<datastore>] fcd/<vmdk file>.vmdk: VM.AddDevice failed error = context deadline exceeded Post "https://localhost/sdk": context deadline exceeded: ErrImageSetup

     

  • The Image and its ImageDisk from the above Events output both show Ready and a non-zero Size, indicating no issues with the Image or ImageDisk:
    kubectl get image,imagedisk -A
    
    NAMESPACE      NAME                                                          STATUS   IMAGE URI                                                                                                                                           RESOLVED DISK
    <supervisor service namespace>  image.imagecontroller.vmware.com/<supervisor service image>   Ready    <image repository>@sha256:<sha> <imageDisk name>
    
    NAMESPACE                 NAME                                                                         STATUS   DISK                                   SIZE
    vmware-system-kubeimage   imagedisk.imagecontroller.vmware.com/<imageDisk name>   Ready    <disk ID>   <Size>

     

While SSH into the ESXi host that the affected supervisor service pod or vSphere pod is trying to start on:

  • The following log snippets are present in /var/run/log/spherelet.log, where values in brackets <> will vary by environment and supervisor service:
    "YY-MM-DDTHH:MM:SSZ" No(5) spherelet[<OP_ID>]: time=""YY-MM-DDTHH:MM:SSZ"" level=error msg="unexpected fault: &{{{{<nil> [{{} msg.disk.hotadd.Failed [{{} 1 scsi2:0}] Failed to add disk 'scsi2:0'.} {{} msg.disk.hotadd.poweron.failed [{{} 1 scsi2:0}] Failed to power on 'scsi2:0'.} {{} msg.disk.noBackEnd [{{} 
    
    1 /vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk}] Cannot open the disk '/vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk' or one of the snapshot disks it depends on. } 
    
    {{} msg.disklib.INVALIDMULTIWRITER [] Thin/TBZ/Sparse disks cannot be opened in multiwriter mode} {{} vob.fssvec.OpenFile.file.failed [] File system specific implementation of OpenFile[file] failed} {{} msg.disk.invalidClusterDisk [{{} 1 VMware ESX} {{} 2 /vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk}] 
    
    VMware ESX cannot open the virtual disk \"/vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk\" for clustering. Verify that the virtual disk was created using the thick option. }]}}} Failed to add disk 'scsi2:0'.} taskerror: Failed to add disk 'scsi2:0'." VM-OP=AddDevice namespace=<supervisor service namespace> pod=<supervisor service pod>

     

  • Similar errors can be found in the ESXI host's /var/log/run/hostd.log as below:
    Hostd[2100504]: [Originator@6876 sub=Vigor.Vmsvc.vm:/vmfs/volumes/<volume id>/<supervisor service pod>/<supervisor service pod>.vmx] Set disk device present message: Failed to add disk 'scsi2:0'.
    
    YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> Failed to power on 'scsi2:0'.YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> Cannot open the disk '/vmfs/volumes/<volume id>/fcd/<vmdk file>.vmdk' or one of the snapshot disks it depends on.
    
    YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> Thin/TBZ/Sparse disks cannot be opened in multiwriter modeYYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> File system specific implementation of OpenFile[file] failedYYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> VMware ESXi cannot open the virtual disk "/vmfs/volumes/<volume id>/fcd/<vmdk file>.vmdk" for clustering. Verify that the virtual disk was created using the thick option.

 

Environment

vSphere Supervisor

Cause

The Supervisor cluster is configured to use a Storage policy for Supervisor services and vSphere Pods that does not support multiwriter.

VMFS is a clustered file system that disables (by default) multiple virtual machines from opening and writing to the same virtual disk (.vmdk file). This prevents more than one virtual machine from inadvertently accessing the same .vmdk file.

This issue will occur when the Storage Policy does not have the VMFS rule set to volume allocation type of "Fully Initialized".

Resolution

Resolution 1: Create a New Storage Policy

  1. In the vSphere web UI, create a New Storage Policy and make sure you select the "Fully Initialized" under Volume allocation type. 

  2. Navigate to Workload Management> Supervisor> Configure> Storage Policy> Edit to change to the above, newly created Storage policy.

  3. Proceed to "Clear out the ImageDisk cache and retry the installation" below

Resolution 2: Update the existing Storage Policy

  1. Modify the existing Storage Policy to "Fully Initialized".
    • Edit VM Storage Policy> VMFS rules> Placement> Volume allocation type> Fully Initialized

  2. Proceed to "Clear out the ImageDisk cache and retry the installation" below

 

Clear out the ImageDisk cache and retry the installation

  1. Connect into the Supervisor cluster context

  2. Locate the Images associated with the failed Supervisor service or vSphere Pod install:
    1. List all images in all namespaces:
      kubectl get image -A
    2. Note down the Resolved Disk column which will correspond to the name of the corresponding ImageDisk's name:
      NAMESPACE      NAME                                                          STATUS   IMAGE URI                                                                                                                                           RESOLVED DISK
      <supervisor service namespace>  image.imagecontroller.vmware.com/<supervisor service image>   Ready    <image repository>@sha256:<sha> <imageDisk name>


  3. Delete each ImageDisk found in the previous step associated with the failed Supervisor service or vSphere Pod:
    kubectl get imagedisk -A | grep <imageDisk name>
    
    kubectl delete imageDisk -n vmware-system-kube-image <imageDiskName>

    Note: ImageDisks and corresponding VMDKs are cached objects. The next newly created pods will automatically create new ImageDisks using the new volume allocation type.

  4. Retry the Supervisor service or vSphere Pod install.

  5. If the install continues to fail with the errors from the Issue Introduction:
    1. Ensure that all corresponding ImageDisks for the Supervisor service or vSphere Pod were cleaned up before the re-install attempt:
      kubectl get image,imagedisk -A


    2. Use DCLI commands from SSH into the vCenter VM to confirm on the storage policies assigned to the Supervisor cluster:
      1. Retrieve the cluster ID for the Supervisor cluster (domain-c#):
        dcli com vmware vcenter namespacemanagement clusters list

         

      2. Note down the storage policy IDs for the below fields for the Supervisor cluster:
        dcli com vmware vcenter namespacemanagement clusters get --cluster <domain-c#>
        
        ephemeral_storage_policy
        image_storage: storage-policy
        master_storage_policy

         

      3. Find the corresponding storage policy ID in vCenter:
        dcli com vmware vcenter storage policies list

         

      4. Ensure that the storage policy ID for ephemeral_storage_policy and image_storage: storage-policy are the intended storage policy with the VMFS rule change from above.