Supervisor Cluster is stuck in 'Configuring' due to hostPort collision

search cancel

Supervisor Cluster is stuck in 'Configuring' due to hostPort collision

book

Article ID: 414674

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

Supervisor Cluster is stuck in a "Configuring" state, and the velero.vsphere.vmware.com service reports a ReconcileFailed error.
The error observed in the UI:

Reason: ReconcileFailed Message: kapp: Error: waiting on reconcile packageinstall/tanzu-cluster-api-bootstrap-kubeadm (packaging.carvel.dev/v1alpha1) namespace: svc-tkg-domain-c2010: Finished unsuccessfully (Reconcile failed: (message: kapp: Error: waiting on reconcile deployment/capi-kubeadm-bootstrap-controller-manager (apps/v1) namespace: svc-tkg-domain-c2010: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager-6b8d76f7d7" has timed out progressing.)))) Service: velero.vsphere.vmware.com. Status: Running

The error message indicates that a deployment has timed out while progressing, and the kapp controller is unable to reconcile the tanzu-cluster-api-bootstrap-kubeadm package.

The port conflict can be identified by running the following command against the Supervisor Cluster's API:

kubectl get po -o yaml -A | grep -i hostport | sort | uniq -c | grep -E '9875|9441|8085' 4 hostPort: 8085 3 hostPort: 9441 3 hostPort: 9875

Environment

VMware Kubernetes Service

Cause

This issue occurs when two or more Supervisor Services attempt to bind to the same hostPort on the underlying ESXi nodes. In this case, both the capi-kubeadm-bootstrap-controller-manager and velero.vsphere.vmware.com deployments compete for hostPort 8085.
If velero and the existing capi-kubeadm pods are already using hostPort 8085 on all available nodes, the new capi-kubeadm pod will be unable to schedule. This scheduling deadlock prevents the deployment from completing, which eventually triggers a ProgressDeadlineExceeded error and causes the Supervisor Cluster to get stuck in a "Configuring" state.

Resolution

To resolve the port conflict and unblock the Supervisor Cluster configuration, follow the below steps:

1. To release hostPort 8085, scale the Velero backup-driver deployment to zero replicas.
  kubectl scale deploy/backup-driver -n velero --replicas=0
2. Restart the wcp service on the vCenter Server.
  service-control --restart wcp
3. After the Supervisor Cluster has finished reconciling and is running, scale the Velero backup-driver deployment back to its desired replica count.
  kubectl scale deploy/backup-driver -n velero --replicas=1

Feedback

thumb_up Yes

thumb_down No