VCF Management Services cluster or the VCF Automation cluster does not automatically recover after powering on VMs following a graceful shutdown
search cancel

VCF Management Services cluster or the VCF Automation cluster does not automatically recover after powering on VMs following a graceful shutdown

book

Article ID: 440862

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

  • VMs power on successfully but VCF Management Services UI (Fleet Lifecycle) remains inaccessible after 20+ minutes.
  • VCF Automation UI shows components as unavailable or in error state.
  • Services appear to be at 0 replicas and do not scale up automatically.
  • Kubernetes nodes remain in Ready,SchedulingDisabled state indefinitely.

Environment

  • VCF Management Services Runtime 9.1.0.0
  • VCF Automation 9.1.0.0

Cause

During graceful shutdown, the internal cluster management service may mark certain nodes for deletion. After power-on, these nodes remain in a cordoned state, preventing the automatic recovery service from completing. The recovery process waits for all nodes to be usable before scaling services back up, creating a deadlock for bring up.

Resolution

Step 1: Power on the cluster VMs

Note: For Automation, skip to step 5, all Node VMs are Control and Worker, order does not matter.

  1. Open VCF Operations UI and navigate to Build > Lifecycle Components tab.
  2. Click on the VCF Services Runtime link.
  3. Scroll down the page to the Nodes section.
  4. Identify the Control Plane nodes.
  5. In vCenter, navigate to VMs and Templates view.
  6. Locate the appropriate VM folder:
    • VCF Management Services cluster: vcf-management-services folder
    • VCF Automation cluster: vcf-automation folder
  7. Select all Control Plane VMs in the cluster.
  8. Right-click and select Power → Power On.
  9. Once powered on, select the Worker Node VMs and power those on.

Step 2: Wait for automatic recovery

Allow 15-20 minutes for the automatic recovery process to complete. The platform includes a systemd service that automatically scales services back to their original replica counts.

Step 3: Validate cluster recovery

Verify the cluster is operational by checking:

  1. UI Access: Navigate to Fleet Lifecycle Manager UI (for management clusters) or VCF Automation services UI (for automation clusters). The UI should be accessible and responsive
  2. Service Status: If UI access is unavailable, engage Broadcom Technical Support.