Supervisor Cluster is stuck in 'Configuring' due to hostPort collision
search cancel

Supervisor Cluster is stuck in 'Configuring' due to hostPort collision

book

Article ID: 414674

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

  • Supervisor Cluster is stuck in a "Configuring" state, and the velero.vsphere.vmware.com service reports a ReconcileFailed error.

  • The error observed in the UI:

Reason: ReconcileFailed Message: kapp: Error: waiting on reconcile packageinstall/tanzu-cluster-api-bootstrap-kubeadm (packaging.carvel.dev/v1alpha1) namespace: svc-tkg-domain-c2010: Finished unsuccessfully (Reconcile failed: (message: kapp: Error: waiting on reconcile deployment/capi-kubeadm-bootstrap-controller-manager (apps/v1) namespace: svc-tkg-domain-c2010: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager-6b8d76f7d7" has timed out progressing.)))) Service: velero.vsphere.vmware.com. Status: Running

The error message indicates that a deployment has timed out while progressing, and the kapp controller is unable to reconcile the tanzu-cluster-api-bootstrap-kubeadm package.

  • The port conflict can be identified by running the following command against the Supervisor Cluster's API:

    kubectl get po -o yaml -A | grep -i hostport | sort | uniq -c | grep -E '9875|9441|8085'

          4         hostPort: 8085
          3         hostPort: 9441
          3         hostPort: 9875

Environment

  • VMware Kubernetes Service

Cause

  • This issue occurs when two or more Supervisor Services attempt to bind to the same hostPort on the underlying ESXi nodes. In this case, both the capi-kubeadm-bootstrap-controller-manager and velero.vsphere.vmware.com deployments compete for hostPort 8085.

  • If velero and the existing capi-kubeadm pods are already using hostPort 8085 on all available nodes, the new capi-kubeadm pod will be unable to schedule. This scheduling deadlock prevents the deployment from completing, which eventually triggers a ProgressDeadlineExceeded error and causes the Supervisor Cluster to get stuck in a "Configuring" state.

Resolution

  • To resolve the port conflict and unblock the Supervisor Cluster configuration, follow the below steps:
    1. To release hostPort 8085, scale the Velero backup-driver deployment to zero replicas.
      kubectl scale deploy/backup-driver -n velero --replicas=0

    2. Restart the wcp service on the vCenter Server.
      service-control --restart wcp

    3. After the Supervisor Cluster has finished reconciling and is running, scale the Velero backup-driver deployment back to its desired replica count.
      kubectl scale deploy/backup-driver -n velero --replicas=1