TKGS Cluster is rolling continuously. VMOP pod logs show "VirtualMachine "msg"="patch failed" "
search cancel

TKGS Cluster is rolling continuously. VMOP pod logs show "VirtualMachine "msg"="patch failed" "

book

Article ID: 313108

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

TKGS Guest clusters are continuously rolling and VMOP pods fail to pull images and VM Service has a different content library containing TKR's added to a namespace. This causes the TKR's to flip between the two content libraries. 

Environment

VMware vCenter Server 7.0.x

Cause

This is due to the TKR logic pulling from both VM Service images and the TKGS repo content library. 

This will cause duplicate vmimages, where the duplicate will contain a -xxxxx random 5 characters afterwards. 

For example.

root@420d148f885b0973b04b21207f304e73 [ ~ ]# kubectl get vmimage
NAME                                                              CONTENTSOURCENAME                      VERSION                          OSTYPE                FORMAT   AGE
ob-22757567-tkgs-ova-photon-3-v1.25.13---vmware.1-fips.1-tkd7np   dc2a2c57-8e48-422b-a762-fddcaddeb5f2   v1.25.13+vmware.1-fips.1-tkg.1   vmwarePhoton64Guest   ovf      8m50s
ob-22757567-tkgs-ova-photon-3-v1.25.13---vmware.1-fips.1-tkg.1    461f6248-8e8d-4c9f-a4cc-61c75dc98ce1   v1.25.13+vmware.1-fips.1-tkg.1   vmwarePhoton64Guest   ovf      26m

This can be caused by a few different configurations.

1. There is a duplicate image added to the global TKr content library.
2. The is a different content library that had a TKr in it added to the VM Service Card.
3. The same content library is configured for both the VM Service card and the global TKr content library. 

 

 

Note: Global TKr content library refers to the content library configured in the Tanzu Kubernetes Service card and is shared between all Supervisor Namespaces under that Supervisor Cluster.

It is also configured from the inventory screen at clusterObject->Configure->Supervisor Cluster->General->Tanzu Kubernetes Grid Service

Changing it in either spot does not matter and it will update it across all Supervisor Namespaces under that Supervisor Cluster. 

Resolution

Remove any TKR's from the VM Service Content Library's Card.

The VM Service Card should only contain content libraries that include VM Service VM's otherwise the cluster is susceptible to this issue. 

If the cluster is not being used for VM Service VMs, then the VM Service card should contain 0 Associated Content Libraries. 

 

For example this namespace has the wrong configuration since it has multiple content libraries that include TKR's in the VM Service Card. Including the global TKr setting which is set to TKr_contentLibrary_local

root@420d148f885b0973b04b21207f304e73 [ ~ ]# kubectl get vmimage
NAME                                                              CONTENTSOURCENAME                      VERSION                          OSTYPE                FORMAT   AGE
ob-22757567-tkgs-ova-photon-3-v1.25.13---vmware.1-fips.1-t5clxh   745fae78-c623-475b-9ced-951142f2b271   v1.25.13+vmware.1-fips.1-tkg.1   vmwarePhoton64Guest   ovf      3m27s
ob-22757567-tkgs-ova-photon-3-v1.25.13---vmware.1-fips.1-tkd7np   dc2a2c57-8e48-422b-a762-fddcaddeb5f2   v1.25.13+vmware.1-fips.1-tkg.1   vmwarePhoton64Guest   ovf      24m
ob-22757567-tkgs-ova-photon-3-v1.25.13---vmware.1-fips.1-tkg.1    461f6248-8e8d-4c9f-a4cc-61c75dc98ce1   v1.25.13+vmware.1-fips.1-tkg.1   vmwarePhoton64Guest   ovf      41m

 

 

The resolution is to remove the content libraries containing TKRs from the VM Service Card. Then validate that the duplicate vmimages are gone.

root@420d148f885b0973b04b21207f304e73 [ ~ ]# kubectl get vmimage
NAME                                                             CONTENTSOURCENAME                      VERSION                          OSTYPE                FORMAT   AGE
ob-22757567-tkgs-ova-photon-3-v1.25.13---vmware.1-fips.1-tkg.1   461f6248-8e8d-4c9f-a4cc-61c75dc98ce1   v1.25.13+vmware.1-fips.1-tkg.1   vmwarePhoton64Guest   ovf      49m

 

If there are still stale images, there are two troubleshooting steps to remove them. 

1. Restart the VMOP pods. 

root@420d148f885b0973b04b21207f304e73 [ ~ ]# kubectl rollout restart deployment -n vmware-system-vmop vmware-system-vmop-controller-manager
deployment.apps/vmware-system-vmop-controller-manager restarted
root@420d148f885b0973b04b21207f304e73 [ ~ ]#

2. Delete the duplicate images.

root@420d148f885b0973b04b21207f304e73 [ ~ ]# kubectl delete vmimage <Name_of_duplicate_image_ending_with_-xxxxx>

Provided that there is no duplicate content library configured in the VM Service Card for any of the Namespaces under that Supervisor, the vmimage should not return. If it does return, check the vmop logs to find what content library it is pulling the vmimage from. These should be repopulated by the vmop pod, so as an additional troubleshooting step, it is safe to delete all vmimages and only 1 should come back. 

 

Please note, while troubleshooting duplicate vmimage issues, TKC's using these images MAY trigger a cluster roll.