CAPV controller manager stops working after some time in TKGm v2.5.1
search cancel

CAPV controller manager stops working after some time in TKGm v2.5.1


Article ID: 370310


Updated On:


Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid


CAPV controller manager stops working after some time in TKGm v2.5.1.

In TKGm v2.5.1, after the management cluster has been running for some time, the cluster operations (e.g., create, scale, update, delete) appear to stop working.

# kubectl get ma -A
NAME                         CLUSTER     NODENAME                     PROVIDERID               PHASE          AGE   VERSION
cluster-A-controlplane-xxx   cluster-A   cluster-A-controlplane-xxx   vsphere://4235c561-xxx   Running        10d   v1.28.7+vmware.1
cluster-A-md-0-yyy           cluster-A   cluster-A-md-0-yyy           vsphere://42356c5d-yyy   Provisioning   15m   v1.28.7+vmware.1 # <---- stuck!
cluster-A-md-0-zzz           cluster-A   cluster-A-md-0-zzz           vsphere://4235e0d4-zzz   Deleting       10d   v1.28.7+vmware.1 # <---- stuck!


The capv-controller-manager pod still shows as "running" but fails to complete its operations.

# kubectl -n capv-system get pods
NAME                                      READY   STATUS    RESTARTS      AGE
capv-controller-manager-d99b6c77c-lgb46   1/1     Running   1 (21h ago)   93d


VMware Tanzu Kubernetes Grid v2.5.1


The issue is due to the "keep-alive" flag that was introduced in CAPV v1.8.8 that is included in TKGm v2.5.1.


The workaround for this issue is to manually edit the "capv-controller-manager" deployment in the management cluster and to disable the keep-alive option.  See the following steps:


  1. Manually edit the deployment using kubectl. 
    kubectl -n capv-system edit deploy capv-controller-manager
  2.  Find the keep-alive setting in .spec.template.spec.containers.args section.
          - args:
            - --leader-elect
            - --v=4
            - --enable-keep-alive
            - --feature-gates=NodeAntiAffinity=true
  3. Modify the enable-keep-alive line to disable the keep-alive flag.
            - --enable-keep-alive=false
  4. Save the file and quit the editor.  Afterwards, the pod would be automatically restarted and that should effectively fix the issue and resume the CAPV cluster operations.
The issue will be fixed in TKGm v2.5.2.