TKG Management clusters fail to upgrade or upgrade slowly when behind Azure internal load balancers.
search cancel

TKG Management clusters fail to upgrade or upgrade slowly when behind Azure internal load balancers.

book

Article ID: 313087

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
  • TKG upgrades may stall indefinitely or take long periods of time to complete. 

  • The control plane machine object’s EtcdMemberHealthy condition will show as Unknown with a similar error as below.

 

 - lastTransitionTime: "2022-07-01T15:11:30Z"

    message: 'Failed to connect to the etcd pod on the k8smgtlab0-control-plane-jdtmr

      node: could not establish a connection to any etcd node: unable to create etcd

      client: context deadline exceeded'

    reason: MemberInspectionFailed

    status: Unknown

    type: EtcdMemberHealthy



Cause

When the kubeadm control plane controller manager pod is scheduled on a controlplane node and also a member of an Azure internal load balancer, etcd health check calls originating from the controlplane node to the load balancer IP and then routed back to the controlplane node will fail.

Toleration is set on deployment capi-kubeadm-control-plane-controller-manager:

tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane

 

Resolution

No resolution within the current versions.
A node affinity will be added to kubeadm control plane controller manager deployments in future releases.


Workaround:

Manually edit the KCP controller deployment after the new capi controller is installed to include the following node affinity within the pod spec using the command:

kubectl edit deployment capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system

 

Pod Node Affinity:

      affinity:

        nodeAffinity:

          preferredDuringSchedulingIgnoredDuringExecution:

          - preference:

              matchExpressions:

              - key: node-role.kubernetes.io/control-plane

                operator: DoesNotExist

            weight: 100

 

The pod should be terminated and rescheduled to a worker node within the cluster. The node can be verified using the following command

 

kubectl get po -A -l cluster.x-k8s.io/provider=control-plane-kubeadm -o wide

 

NOTE: If the pod is not rescheduled on a worker node, verify there is sufficient capacity and availability to run






Additional Information

Impact/Risks:
This issue is currently affecting Tanzu Kubernetes Grid 1.5.x and 1.6.x.