CAPV controller manager stops working after some time in TKGm v2.5.1
search cancel

CAPV controller manager stops working after some time in TKGm v2.5.1

book

Article ID: 370310

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid

Issue/Introduction

CAPV controller manager stops working after some time in TKGm v2.5.1.

In TKGm v2.5.1, after the management cluster has been running for some time, the cluster operations (e.g., create, scale, update, delete) appear to stop working.

# kubectl get ma -A
NAME                         CLUSTER     NODENAME                     PROVIDERID               PHASE          AGE   VERSION
cluster-A-controlplane-xxx   cluster-A   cluster-A-controlplane-xxx   vsphere://4235c561-xxx   Running        10d   v1.28.7+vmware.1
cluster-A-md-0-yyy           cluster-A   cluster-A-md-0-yyy           vsphere://42356c5d-yyy   Provisioning   15m   v1.28.7+vmware.1 # <---- stuck!
cluster-A-md-0-zzz           cluster-A   cluster-A-md-0-zzz           vsphere://4235e0d4-zzz   Deleting       10d   v1.28.7+vmware.1 # <---- stuck!

 

The capv-controller-manager pod still shows as "running" but fails to complete its operations.

# kubectl -n capv-system get pods
NAME                                      READY   STATUS    RESTARTS      AGE
capv-controller-manager-d99b6c77c-lgb46   1/1     Running   1 (21h ago)   93d

Environment

VMware Tanzu Kubernetes Grid v2.5.1

Cause

The issue is due to the "keep-alive" flag that was introduced in CAPV v1.8.8 that is included in TKGm v2.5.1.

Resolution

The workaround for this issue is to manually edit the "capv-controller-manager" deployment in the management cluster and to disable the keep-alive option.  See the following steps:

 

  1. Manually edit the deployment using kubectl. 
    kubectl -n capv-system edit deploy capv-controller-manager
  2.  Find the keep-alive setting in .spec.template.spec.containers.args section.
    spec:
    ...
      template:
    ...
        spec:
          containers:
          - args:
            - --leader-elect
            - --v=4
            - --enable-keep-alive
            - --feature-gates=NodeAntiAffinity=true
  3. Modify the enable-keep-alive line to disable the keep-alive flag.
            - --enable-keep-alive=false
  4. Save the file and quit the editor.  Afterwards, the pod would be automatically restarted and that should effectively fix the issue and resume the CAPV cluster operations.
     
The issue will be fixed in TKGm v2.5.2.