Nodes in the affected vSphere Supervisor Workload cluster are recreated in a loop every 60 to 120 minutes.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get machines -n <affected workload cluster namespace>
kubectl get virtualmachineimagesNAME CONTENTSOURCENAME VERSION OSTYPE FORMAT AGE
ob-<id>-tkgs-ova-photon-X-<TKR version>---vmware.X-fips.X-tkd7np <content-library-id-a> <TKR version>+vmware.X-fips.X-tkg.X vmwarePhoton64Guest ovf XmXXs
ob-<id>-tkgs-ova-photon-X-<TKR version>---vmware.X-fips.X-tkg.X <content-library-id-b> <TKR version>+vmware.X-fips.X-tkg.X vmwarePhoton64Guest ovf XmXXs
There may be multiple content libraries containing the same TKR osimages attached to any namespace in the Supervisor cluster environment:
kubectl get contentsources -A
From the vSphere web Client, there is a content library assigned to the Tanzu Kubernetes Grid Service card and on the VM Service card of any namespace.
This duplicate image issue can also occur if there is only one content library attached to the Tanzu Kubernetes Grid Service card but this content library contains duplicate TKR osimages.
The content library attached to the Tanzu Kubernetes Grid Service card is shared across all namespaces in the Supervisor cluster environment.
The environment attempts to reconcile and pull images from the content libraries associated with the namespaces in the Supervisor cluster environment.
However, if a content library containing the same TKR osimages is attached to both the Tanzu Kubernetes Grid Service and VM Service card in the namespace, this will result in duplicate virtualmachineimages in the environment. This can also occur if the same content library is attached and assigned in both of the above cards.
In vSphere 7.X, the affected workload cluster nodes will continue to recreate because of the conflicting duplicate images from the above noted content libraries. The system will continue to try to deploy nodes by pulling from the associated content libraries but remain stuck in a loop due to the duplicate images.
This duplicate image issue can also occur if there is only one content library attached to the Tanzu Kubernetes Grid Service card but this content library contains duplicate TKR osimages.
The content library attached to the Tanzu Kubernetes Grid Service card is shared between all namespaces in the Supervisor cluster environment.
Note: Due to a race condition in vSphere 7.x, duplicate images can also occur despite there being no duplicate images in the content library and despite the content library being configured properly.
This issue is resolved in vSphere 8.0u2 and higher.
The duplicate virtualmachineimages will need to be cleaned up from the Supervisor cluster environment. This involves checking for multiple content libraries with the same TKR osimages assigned to the same namespace, duplicate TKR osimages within the same content library and removing the duplicate reference or images from the Supervisor cluster environment.
kubectl get cluster -o yaml -n <workload cluster namespace> <workload cluster name> | grep -i "paused"
Note: The only Content Library containing TKRs should be configured in the Supervisor Cluster's Tanzu Kubernetes Grid Service:
In the example above, as duplicated TKRs exist in both Content Libraries, we will delete "TKG-CL-2" and leave "TKG-Content-Library" configured in TKG Service for the existing Supervisor Cluster.
To remove the Content Library:
* Select the Content Library:
* Click on Actions -> Delete
* Select "Yes" after the Warning message shows up.
* Go back to Content Libraries and verify that it has been deleted successfully.
Now that only one Content Library with TKRs is left in vCenter, verify that it's assigned in the Supervisor Cluster's TKG Service:
Workload Management -> Supervisors -> Select the Supervisor Cluster -> Configure -> General -> Tanzu Kubernetes Grid Service -> Content Library
kubectl get virtualmachineimage
kubectl rollout restart deploy -n vmware-system-vmop vmware-system-vmop-controller-manager
kubectl delete virtualmachineimage <duplicate image with extra alphanumerics>
kubectl get cluster -o yaml -n <workload cluster namespace> <workload cluster name> | grep -i "paused"
This issue is resolved in vSphere 8.0u2 and higher.
Impact/Risks:
Caution: When modifying/deleting a Content Library from vCenter, please make sure it's not being actively used by any other Supervisor cluster, TKC cluster, etc.