Tanzu Kubernetes Cluster Upgrade Stuck - reports EtcdMemberHealthy condition is unknown
search cancel

Tanzu Kubernetes Cluster Upgrade Stuck - reports EtcdMemberHealthy condition is unknown

book

Article ID: 319419

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere with Tanzu VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Symptoms:
  • Upgrading a Tanzu Kubernetes cluster via TMC, Tanzu CLI, or yaml edit results in no new control plane nodes.
  • Scaling a Tanzu Kubernetes Cluster control plane count results in no new nodes.
  • Machine Health Check of control plane nodes is not replacing a broken control plane node.
  • Checking etcd status on the existing nodes shows that the etcd cluster is healthy and status is running.
  • VMware Carbon Black Cloud Container Operator or another 3rd party security policy controller is deployed to the cluster and not allowing port forwarding in the kube-system namespace.
 
  • Logs from the capi kubeadm control plane manager pod shows the following message:

 

Command to get logs from vSphere with Tanzu Supervisor podskubectl logs -n vmware-system-capw capi-kubeadm-control-plane-controller-manager-XXXXXXX -c manager

Command to get logs from TKG Management cluster podskubectl logs -n capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-XXXXX manager
 

controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="foo-prod" "kubeadmControlPlane"="foo-prod-control-plane" "namespace"="default" "failures"="machine foo-prod-control-plane-g7s2c reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the foo-prod-control-plane-g7s2c node: unable to create etcd client: endpoints: [etcd-foo-prod-control-plane-g7s2c], proxy.KubeConfig.Host: https://<KUBEAPI_IP>:6443: context deadline exceeded)"
 

Execute `etcdctl --cluster=true endpoint health --write-out=table`  on the guest cluster
Output that shows that the etcd status on each member is healthy:
 

 

  • Guest Cluster Control Plane logging will present logging similar to:
The apiserver pod logging might report security policy violations related to Port forwarding (the error below is presented if CarbonBlack PortBlock security policy is applied to kube-system namespace):
 
W0312 08:15:35.048653 1 dispatcher.go:161] rejected by webhook "resources.validating-webhook.cbcontainers": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue: "", RemainingItemCount: (*int64) (nil)}, Status: "Failure", Message: "admission webhook \"resources.validating-webhook.cbcontainers\" denied the request: Blocked by Kubernetes security policy "Kube-system\".\nViolated rule(s): \n Port forward\n", Reason:"", Details: (*v1.StatusDetails) (nil), Code:400}}

On the control plane node, journalctl -xeu containerd logs show: 
 
failure attempting to dial 127.0.0.1:2379 failed to execute portforward in network namespace "host": failed to dial 2379: dial tcp4 127.0.0.1:2379: connect: connection refused


Environment

VMware vSphere 7.0 with Tanzu
VMware vSphere 8.0 with Tanzu
VMware Tanzu Kubernetes Grid 1.x
VMware Tanzu Kubernetes Grid 2.x

Cause

Tanzu Kubernetes Grid and vSphere with Tanzu use an underlaying open source component called ClusterAPI(CAPI). On the Management cluster or Supervisor cluster there is a controller pod called capi-kubeadm-control-plane-controller-manager this controller requires permissions on the workload/guest cluster to port forward to the etcd pods to check etcd cluster health prior to adding or updating a control plane node. If the controller cannot get etcd status then it will not proceed and the reconcile of control planes will be stalled indefinitely.

Resolution

Ensure that VMware Carbon Black Cloud Container Operator or the 3rd party security policy settings do not block port forwarding in the guest cluster kube-system namespace. Without port forwarding available the ClusterAPI infrastructure on the Management/Supervisor cluster can not validate etcd cluster health or function.

Additional Information