CAPV controller manager stops working after some time in TKGm v2.5.1
search cancel

CAPV controller manager stops working after some time in TKGm v2.5.1

book

Article ID: 370310

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid

Issue/Introduction

CAPV controller manager stops working after some time in TKGm v2.5.1.

In TKGm v2.5.1, after the management cluster has been running for some time, the cluster operations (e.g., create, scale, update, delete) appear to stop working.  The capv-controller-manager pod still shows as "running" but fails to complete its operations.  

Environment

VMware Tanzu Kubernetes Grid v2.5.1

Cause

The issue is due to the "keep-alive" flag that was introduced in CAPV v1.8.8 that is included in TKGm v2.5.1.

Resolution

The workaround for this issue is to manually edit the "capv-controller-manager" deployment in the management cluster and to disable the keep-alive option.  See the following steps:

 

  1. Manually edit the deployment using kubectl. 
    kubectl -n capv-system edit deploy capv-controller-manager
  2.  Find the keep-alive setting in .spec.template.spec.containers.args section.
    spec:
    ...
      template:
    ...
        spec:
          containers:
          - args:
            - --leader-elect
            - --v=4
            - --enable-keep-alive
            - --feature-gates=NodeAntiAffinity=true
  3. Modify the enable-keep-alive line to disable the keep-alive flag.
            - --enable-keep-alive=false
  4. Save the file and quit the editor.  Afterwards, the pod would be automatically restarted and that should effectively fix the issue and resume the CAPV cluster operations.
     
The issue will be fixed in TKGm v2.5.2.