CAPV controller manager stops working after some time in TKGm v2.5.1.
In TKGm v2.5.1, after the management cluster has been running for some time, the cluster operations (e.g., create, scale, update, delete) appear to stop working.
# kubectl get ma -A
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
cluster-A-controlplane-xxx cluster-A cluster-A-controlplane-xxx vsphere://4235c561-xxx Running 10d v1.28.7+vmware.1
cluster-A-md-0-yyy cluster-A cluster-A-md-0-yyy vsphere://42356c5d-yyy Provisioning 15m v1.28.7+vmware.1 # <---- stuck!
cluster-A-md-0-zzz cluster-A cluster-A-md-0-zzz vsphere://4235e0d4-zzz Deleting 10d v1.28.7+vmware.1 # <---- stuck!
The capv-controller-manager pod still shows as "running" but fails to complete its operations.
# kubectl -n capv-system get pods
NAME READY STATUS RESTARTS AGE
capv-controller-manager-d99b6c77c-lgb46 1/1 Running 1 (21h ago) 93d
VMware Tanzu Kubernetes Grid v2.5.1
The issue is due to the "keep-alive" flag that was introduced in CAPV v1.8.8 that is included in TKGm v2.5.1.
The workaround for this issue is to manually edit the "capv-controller-manager" deployment in the management cluster and to disable the keep-alive option. See the following steps:
kubectl -n capv-system edit deploy capv-controller-manager
spec:
...
template:
...
spec:
containers:
- args:
- --leader-elect
- --v=4
- --enable-keep-alive
- --feature-gates=NodeAntiAffinity=true
- --enable-keep-alive=false