Harbor pods stuck in ProviderFailed state after ESXi host failure
search cancel

Harbor pods stuck in ProviderFailed state after ESXi host failure

book

Article ID: 441489

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Harbor pods are stuck in an error state within the Supervisor cluster following a hardware failure on an ESXi host.

  • Affected pods remain in a ProviderFailed state

    root@<Supervisor> [ ~ ]# k get pods -n svc-harbor-domain-c# -o wide | grep <hostname>
    harbor-core-####         0/1     ProviderFailed            0       days     <none>         <hostname>   <none>     <none>

Environment

vSphere Kubernetes Service

Cause

  • The Supervisor cluster is experiencing a scheduling deadlock because the Harbor pod VMs are registered to the failed physical host.
  • When an ESXi host suffers a sudden hardware failure or disconnects unexpectedly, the control plane cannot cleanly migrate or terminate the pod VMs running on that specific hypervisor. Because vCenter still tracks these objects as registered to the disconnected host, the Supervisor cluster is unable to reschedule them. The pods remain pinned to the failed compute resource because the underlying infrastructure provider is unavailable to execute runtime operations.

Resolution

  1. Disconnect and remove the failed ESXi host from the vSphere cluster.
  2. Manually unregister any orphaned or stale Harbor pod VMs from the vCenter inventory.
    Refer: Manually removing stale or orphaned virtual machines from vCenter Server
  3. Forcefully delete the stuck Harbor pods within the Supervisor cluster.