TKG Management clusters fail to upgrade or upgrade slowly when behind Azure internal load balancers.
search cancel

TKG Management clusters fail to upgrade or upgrade slowly when behind Azure internal load balancers.


Article ID: 313087


Updated On:




  • TKG upgrades may stall indefinitely or take long periods of time to complete. 

  • The control plane machine object’s EtcdMemberHealthy condition will show as Unknown with a similar error as below.


 - lastTransitionTime: "2022-07-01T15:11:30Z"

    message: 'Failed to connect to the etcd pod on the k8smgtlab0-control-plane-jdtmr

      node: could not establish a connection to any etcd node: unable to create etcd

      client: context deadline exceeded'

    reason: MemberInspectionFailed

    status: Unknown

    type: EtcdMemberHealthy


When the kubeadm control plane controller manager pod is scheduled on a controlplane node and also a member of an Azure internal load balancer, etcd health check calls originating from the controlplane node to the load balancer IP and then routed back to the controlplane node will fail.

Toleration is set on deployment capi-kubeadm-control-plane-controller-manager:

- effect: NoSchedule
- effect: NoSchedule



No resolution within the current versions.
A node affinity will be added to kubeadm control plane controller manager deployments in future releases.


Manually edit the KCP controller deployment after the new capi controller is installed to include the following node affinity within the pod spec using the command:

kubectl edit deployment capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system


Pod Node Affinity:




          - preference:


              - key:

                operator: DoesNotExist

            weight: 100


The pod should be terminated and rescheduled to a worker node within the cluster. The node can be verified using the following command


kubectl get po -A -l -o wide


NOTE: If the pod is not rescheduled on a worker node, verify there is sufficient capacity and availability to run

Additional Information

This issue is currently affecting Tanzu Kubernetes Grid 1.5.x and 1.6.x.