CAPV controller manager stops working after some time in TKGm v2.5.1.
In TKGm v2.5.1, after the management cluster has been running for some time, the cluster operations (e.g., create, scale, update, delete) appear to stop working. The capv-controller-manager pod still shows as "running" but fails to complete its operations.
VMware Tanzu Kubernetes Grid v2.5.1
The issue is due to the "keep-alive" flag that was introduced in CAPV v1.8.8 that is included in TKGm v2.5.1.
The workaround for this issue is to manually edit the "capv-controller-manager" deployment in the management cluster and to disable the keep-alive option. See the following steps:
kubectl -n capv-system edit deploy capv-controller-manager
spec:
...
template:
...
spec:
containers:
- args:
- --leader-elect
- --v=4
- --enable-keep-alive
- --feature-gates=NodeAntiAffinity=true
- --enable-keep-alive=false