ESXi Upgrade Fails with "Cluster domain-c## is not ready yet" in VCF Operations Manager 9.x
search cancel

ESXi Upgrade Fails with "Cluster domain-c## is not ready yet" in VCF Operations Manager 9.x

book

Article ID: 433650

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware Cloud Foundation

Issue/Introduction

  • ESXi Host upgrade fails via VCF 9.x Operations Manager.
  • Under VCF Ops Manager UI, Fleet Management > Lifecycle > VCF Instances > SDDC Manager > Workload Domain > Updates, the process fails during the “Cluster image and compatibility checks” stage with the error: “Cluster domain-C## is not ready"



  • Under vCenter UI, Cluster > Updates > Image, below error is observed:



  • Under vCenter UI, Inventory > Cluster > Configure > Namespaces > Supervisors, below error is observed:



  • Under vCenter's /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log, below log snips are found:

    YYYY-MM-DDTHH:MM:SS error vmware-vum-server [######] [Originator@#### sub=IO.Http] User agent failed to send request; (null), N7Vmacore17CanceledException 
    YYYY-MM-DDTHH:MM:SS info vmware-vum-server [######] [Originator@#### sub=SsoClient] Successfully acquired token: SamlToken [subject=(Name: vpxd-extension-#########-####-####-####-############; Domain:vsphere.local), groups=[{Name: Users; Domain: vsphere. local}, {Name: SolutionUsers; Domain: vsphere.local}, {Name: SystemConfiguration. Administrators; Domain: vsphere.local), {Name: ActAsUsers; Domain: vsphere.local), {Name: ComponentManager. Administrators; Domain: vsphere.local), {Name: AnalyticsService. Administrators; Domain: vsphere.local), {Name: LicenseService. Administrators; Domain: vsphere.local), {Name: ServiceProviderUsers; Domain: vsphere.local), {Name: vStatsGroup; Domain: vsphere.local), {Name: Everyone; Domain: vsphere.local} ], delegationChain=[], startTime=YYYY-MM-DD HH:MM:SS, endTime=YYYY-MM-DD HH:MM:SS, renewCount=0, delegableCount=10, isSolution=true, type=Saml HOK]
    YYYY-MM-DDTHH:MM:SS info vmware-vum-server [#####] [Originator@#### sub=Telemetry] [TelemetryManager ###] Sending telemetry data: {"@type":"pman_error_report","taskIo _report", "taskId": "########-####-####-####-############| ########-####-####-####-###########", "entityId": "########-####-####-####-############|domain-c##", "parentTaskId": "", "errorMessageId": "vcenter.wcp.cluster.notReady", "errorMessage": "Cluster domain-c## is not ready yet. ", "errorTime": "YYYY-MM-DDTHH:MM:SS"}

Environment

  • VCF 9.x.
  • ESXi 9.x.

Cause

The failure occurs because the Supervisor cluster (WCP) is in an unhealthy or incomplete configuration state.

Resolution

To resolve this issue, follow steps below to make the Supervisor cluster to an healthy state:

Step 1: Identify the WCP Error in the WLD vCenter:

  1. Login to vSphere UI Client.
  2. Navigate to Inventory > Cluster > Configure > Namespaces > Supervisors and check the status and error messages.

Step 2: Validate NSX Networking in the NSX manager:

  1. Login to NSX Manager UI.
  2. Navigate to System > Fabric > Nodes > Edge Clusters and validate if the Supervisor is functional and the Control Plane VMs have valid network.

Step 3: To resolve the "Invalid NSX Edge Cluster" or "Configuring" loop, make the Supervisor back to running state. For more information, refer vSphere Supervisor stuck in configuring status with error NSX Edge Cluster <cluster name> is invalid.

Step 4: Verify and Confirm the Supervisor status has changed to Running or Healthy.

  1. Login to vSphere UI Client.
  2. Navigate to Inventory > Cluster > Configure > Namespaces > Supervisors and check the status.

Step 5: Login to the VCF Ops Manager, restart the ESXi Host upgrade process.

Additional Information

  • In this scenario, the NSX Edge Cluster associated with the Control Plane VMs is invalid or misconfigured.
  • Fleet Manager requires all cluster components to be in a "Ready" state to pass pre-upgrade compatibility checks.
  • If the Supervisor configuration is in pending or error state, the cluster is flagged as "Not Ready," which will block all the lifecycle operations.