TKG Management clusters fail to upgrade or upgrade slowly when behind Azure internal load balancers.

search cancel

TKG Management clusters fail to upgrade or upgrade slowly when behind Azure internal load balancers.

book

Article ID: 313087

calendar_today

Updated On: 02-23-2023

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

TKG upgrades may stall indefinitely or take long periods of time to complete.
The control plane machine object’s EtcdMemberHealthy condition will show as Unknown with a similar error as below.

- lastTransitionTime: "2022-07-01T15:11:30Z"

message: 'Failed to connect to the etcd pod on the k8smgtlab0-control-plane-jdtmr

node: could not establish a connection to any etcd node: unable to create etcd

client: context deadline exceeded'

reason: MemberInspectionFailed

status: Unknown

type: EtcdMemberHealthy

Cause

When the kubeadm control plane controller manager pod is scheduled on a controlplane node and also a member of an Azure internal load balancer, etcd health check calls originating from the controlplane node to the load balancer IP and then routed back to the controlplane node will fail.

Toleration is set on deployment capi-kubeadm-control-plane-controller-manager:

tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane

Resolution

No resolution within the current versions.
A node affinity will be added to kubeadm control plane controller manager deployments in future releases.

Workaround:

Manually edit the KCP controller deployment after the new capi controller is installed to include the following node affinity within the pod spec using the command:

kubectl edit deployment capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system

Pod Node Affinity:

affinity:

nodeAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- preference:

matchExpressions:

- key: node-role.kubernetes.io/control-plane

operator: DoesNotExist

weight: 100

The pod should be terminated and rescheduled to a worker node within the cluster. The node can be verified using the following command

kubectl get po -A -l cluster.x-k8s.io/provider=control-plane-kubeadm -o wide

NOTE: If the pod is not rescheduled on a worker node, verify there is sufficient capacity and availability to run

Additional Information

Impact/Risks:
This issue is currently affecting Tanzu Kubernetes Grid 1.5.x and 1.6.x.

Feedback

thumb_up Yes

thumb_down No