Failing to install any supervisor service or create vSphere pods. The Supervisor Services pods are stuck.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get pods -n <supervisor-service-namespace> -o wide
kubectl describe pod -n <supervisor service namespace> <failing supervisor service pod>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled ##s default-scheduler Successfully assigned <supervisor service namespace>/<supervisor service pod> to <ESXi host>
Normal Image ##s image-controller Image <supervisor service Image> bound successfully
Normal SuccessfulUpdate ##s (x# over ##s) pod-controller Pod CR has been successfully updated
Normal Pulling ##s kubelet Waiting for Image <supervisor service namespace>/<supervisor service Image>
Normal Pulled ##s kubelet Image <supervisor service namespace>/<supervisor service Image>
Imagedisk bind failed: Operation cannot be fulfilled on images.imagecontroller.vmware.com "<Image>": the object has been modified; please apply your changes to the latest version and try again
Status: Failed
Reason: ErrImageSetup
Message: failed to setup images: failed to add disk [<datastore>] fcd/<vmdk file>.vmdk: VM.AddDevice failed error = context deadline exceeded Post "https://localhost/sdk": context deadline exceeded: ErrImageSetup
kubectl get image,imagedisk -A
NAMESPACE NAME STATUS IMAGE URI RESOLVED DISK
<supervisor service namespace> image.imagecontroller.vmware.com/<supervisor service image> Ready <image repository>@sha256:<sha> <imageDisk name>
NAMESPACE NAME STATUS DISK SIZE
vmware-system-kubeimage imagedisk.imagecontroller.vmware.com/<imageDisk name> Ready <disk ID> <Size>
While SSH into the ESXi host that the affected supervisor service pod or vSphere pod is trying to start on:
/var/run/log/spherelet.log, where values in brackets <> will vary by environment and supervisor service:"YY-MM-DDTHH:MM:SSZ" No(5) spherelet[<OP_ID>]: time=""YY-MM-DDTHH:MM:SSZ"" level=error msg="unexpected fault: &{{{{<nil> [{{} msg.disk.hotadd.Failed [{{} 1 scsi2:0}] Failed to add disk 'scsi2:0'.} {{} msg.disk.hotadd.poweron.failed [{{} 1 scsi2:0}] Failed to power on 'scsi2:0'.} {{} msg.disk.noBackEnd [{{}
1 /vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk}] Cannot open the disk '/vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk' or one of the snapshot disks it depends on. }
{{} msg.disklib.INVALIDMULTIWRITER [] Thin/TBZ/Sparse disks cannot be opened in multiwriter mode} {{} vob.fssvec.OpenFile.file.failed [] File system specific implementation of OpenFile[file] failed} {{} msg.disk.invalidClusterDisk [{{} 1 VMware ESX} {{} 2 /vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk}]
VMware ESX cannot open the virtual disk \"/vmfs/volumes/<Datastore-Name>/fcd/xxxxx.vmdk\" for clustering. Verify that the virtual disk was created using the thick option. }]}}} Failed to add disk 'scsi2:0'.} taskerror: Failed to add disk 'scsi2:0'." VM-OP=AddDevice namespace=<supervisor service namespace> pod=<supervisor service pod>
Hostd[2100504]: [Originator@6876 sub=Vigor.Vmsvc.vm:/vmfs/volumes/<volume id>/<supervisor service pod>/<supervisor service pod>.vmx] Set disk device present message: Failed to add disk 'scsi2:0'.
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> Failed to power on 'scsi2:0'.YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> Cannot open the disk '/vmfs/volumes/<volume id>/fcd/<vmdk file>.vmdk' or one of the snapshot disks it depends on.
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> Thin/TBZ/Sparse disks cannot be opened in multiwriter modeYYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> File system specific implementation of OpenFile[file] failedYYYY-MM-DDTHH:MM:SS.sssZ Db(167) Hostd[2100484]: --> VMware ESXi cannot open the virtual disk "/vmfs/volumes/<volume id>/fcd/<vmdk file>.vmdk" for clustering. Verify that the virtual disk was created using the thick option.
vSphere Supervisor
The Supervisor cluster is configured to use a Storage policy for Supervisor services and vSphere Pods that does not support multiwriter.
VMFS is a clustered file system that disables (by default) multiple virtual machines from opening and writing to the same virtual disk (.vmdk file). This prevents more than one virtual machine from inadvertently accessing the same .vmdk file.
This issue will occur when the Storage Policy does not have the VMFS rule set to volume allocation type of "Fully Initialized".
Resolution 1: Create a New Storage Policy
Resolution 2: Update the existing Storage Policy
Clear out the ImageDisk cache and retry the installation
kubectl get image -ANAMESPACE NAME STATUS IMAGE URI RESOLVED DISK
<supervisor service namespace> image.imagecontroller.vmware.com/<supervisor service image> Ready <image repository>@sha256:<sha> <imageDisk name>kubectl get imagedisk -A | grep <imageDisk name>
kubectl delete imageDisk -n vmware-system-kube-image <imageDiskName>Note: ImageDisks and corresponding VMDKs are cached objects. The next newly created pods will automatically create new ImageDisks using the new volume allocation type.
kubectl get image,imagedisk -Adcli com vmware vcenter namespacemanagement clusters list
dcli com vmware vcenter namespacemanagement clusters get --cluster <domain-c#>
ephemeral_storage_policy
image_storage: storage-policy
master_storage_policy
dcli com vmware vcenter storage policies list