vSphere with Tanzu tkg-controller-manager pods go into CrashLoopBackOff after upgrading vCenter from 7.0U2 to 7.0U3e (MP2)
search cancel

vSphere with Tanzu tkg-controller-manager pods go into CrashLoopBackOff after upgrading vCenter from 7.0U2 to 7.0U3e (MP2)

book

Article ID: 323445

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
This issue only happens after a VC is upgraded from 7.0U2 or earlier to 7.0U3e(MP2) AND your supervisor cluster is on version 1.19.x. After the vCenter finishes upgrading it will trigger an auto-upgrade of vSphere with Tanzu Supervisor Cluster from 1.19.x to 1.20.x

Issue can be avoided if you upgrade from 7.0U3(c or d)  to U3e or if your supervisor cluster is on 1.20.x or higher before updating the vCenter. 

Environment

VMware vCenter Server 7.0.x

Resolution

Wait for vSphere with Tanzu Supervisor Cluster(Workload Management) upgrade to complete. The TKGController pods should be back to a running state once the SV upgrade completes.

If the SV upgrade is stuck, then apply the following workaround.

Workaround:
1. Check the current SV version
kubectl version

E.g

root@42170b17880a316fef2786a0b0b1a020 [ ~ ]# kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1+wcp.3", >GitCommit:"fb53e772aa2828c695b1f6b9fe83c63f87c27cf6", GitTreeState:"clean", BuildDate:"2021-02-05T05:07:45Z", >GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1+wcp.3", >GitCommit:"fb53e772aa2828c695b1f6b9fe83c63f87c27cf6", GitTreeState:"clean", BuildDate:"2021-02-05T05:02:40Z", >GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

2. Check installed compatibility documents

kubectl get compatibilities -A

Example
root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get compatibilities -A
NAMESPACE NAME AGE
vmware-system-ucs gciscompatibility-1.20.8 12m
vmware-system-ucs vcenter-7.0 73m


3. If the SV version is 1.19 and the GCIScompatibility document is 1.20.8, then copy gciscompatibility document for 1.19.12 and 1.20.8 from VC

3a. ssh to VC and scp gciscompatibility-1.19.12.yaml document to location from where you have access to SV cluster
$scp /etc/vmware/wcp/guestclusters/gciscompatibility-1.19.12.yaml root@<HOST_IP>

3b Also copy gciscompatibility-1.20.8.yaml to location from where you have access to SV cluster
$scp /etc/vmware/wcp/guestclusters/gciscompatibility-1.20.8.yaml root@<HOST_IP>:

4. Apply 1.19.12 document to the Supervisor cluster
$ kubectl apply -f gciscompatibility-1.19.12.yaml

5. Verify that the document has been created
$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get compatibilities -A

Example

$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get compatibilities -A
NAMESPACE NAME AGE
vmware-system-ucs gciscompatibility-1.20.8 7m9s
vmware-system-ucs gciscompatibility-1.19.12 9s
vmware-system-ucs vcenter-7.0 68m

6. Delete 1.20.8 compatibility document
$kubectl delete compatibility gciscompatibility-1.20.8 -n vmware-system-ucs

7.Delete tkg-controller-manager pods, and make sure they are running and healthy once completed
$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get pods -n vmware-system-tkg | grep controller

Example

$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get pods -n vmware-system-tkg | grep controller
vmware-system-tkg-controller-manager-db7f589c8-5r96r 2/2 Running 0 20m
vmware-system-tkg-controller-manager-db7f589c8-m8jzr 2/2 Running 0 20m
vmware-system-tkg-controller-manager-db7f589c8-nbt27 2/2 Running 0 20m

8. When the SV upgrade completes, verify the current SV version
$kubectl version

Example

$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8+vmware.wcp.1", >GitCommit:"e6c9e1282afaff09bfb0f25cddf7bc3f9b0e680d", GitTreeState:"clean", BuildDate:"2021-08-14T16:47:16Z", >GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8+vmware.wcp.1", >GitCommit:"e6c9e1282afaff09bfb0f25cddf7bc3f9b0e680d", GitTreeState:"clean", BuildDate:"2021-08-14T16:42:17Z", >GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}

9. If SV version is 1.20.8, then create gciscompatibility document for 1.20.8 and delete 1.19.12 compact doc
$kubectl apply -f gciscompatibility-1.20.8.yaml
$kubectl delete compatibility gciscompatibility-1.19.1 -n vmware-system-ucs

10. Verify that only 1.20.8 compatibility documents is present
$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get compatibilities -A
NAMESPACE NAME AGE
vmware-system-ucs gciscompatibility-1.20.8 12m
vmware-system-ucs vcenter-7.0 73m

11. Verify that TKRs < 1.18.x have been marked as incompatible
$kubectl get tkr

Example output

$root@4206642f4e48142f37bd17bc3c2a842a [ ~ ]# kubectl get tkr

NAME VERSION READY COMPATIBLE CREATED UPDATES AVAILABLE
v1.16.12---vmware.1-tkg.1.da7afe7 1.16.12+vmware.1-tkg.1.da7afe7 False False 128m
v1.16.14---vmware.1-tkg.1.ada4837 1.16.14+vmware.1-tkg.1.ada4837 False False 128m
v1.16.8---vmware.1-tkg.3.60d2ffd 1.16.8+vmware.1-tkg.3.60d2ffd False False 128m
v1.17.11---vmware.1-tkg.1.15f1e18 1.17.11+vmware.1-tkg.1.15f1e18 False False 128m [1.18.19+vmware.1-tkg.1.17af790]
v1.17.11---vmware.1-tkg.2.ad3d374 1.17.11+vmware.1-tkg.2.ad3d374 False False 128m [1.18.19+vmware.1-tkg.1.17af790]
v1.17.13---vmware.1-tkg.2.2c133ed 1.17.13+vmware.1-tkg.2.2c133ed False False 128m [1.18.19+vmware.1-tkg.1.17af790]
v1.17.17---vmware.1-tkg.1.d44d45a 1.17.17+vmware.1-tkg.1.d44d45a False False 128m [1.18.19+vmware.1-tkg.1.17af790]
v1.17.7---vmware.1-tkg.1.154236c 1.17.7+vmware.1-tkg.1.154236c False False 128m [1.18.19+vmware.1-tkg.1.17af790]
v1.17.8---vmware.1-tkg.1.5417466 1.17.8+vmware.1-tkg.1.5417466 False False 128m [1.18.19+vmware.1-tkg.1.17af790]
v1.18.10---vmware.1-tkg.1.3a6cd48 1.18.10+vmware.1-tkg.1.3a6cd48 True True 12h [1.19.16+vmware.1-tkg.1.df910e2 1.18.19+vmware.1-tkg.1.17af790]
...
...
...

12. In case they are not, restart TKGControllerManager
$kubectl rollout restart deployment -n vmware-system-tkg vmware-system-tkg-controller-manager


Additional Information

Impact/Risks:
The tkg-controller-manager pods go into CrashLoopBackOff state which makes the guest cluster's unmanageable.