VCF Management Services cluster or the VCF Automation cluster does not automatically recover after powering on VMs following a graceful shutdown

search cancel

VCF Management Services cluster or the VCF Automation cluster does not automatically recover after powering on VMs following a graceful shutdown

book

Article ID: 440862

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

VMs power on successfully but VCF Management Services UI (Fleet Lifecycle) remains inaccessible after 20+ minutes.
VCF Automation UI shows components as unavailable or in error state.
Services appear to be at 0 replicas and do not scale up automatically.
Kubernetes nodes remain in Ready,SchedulingDisabled state indefinitely.

Environment

VCF Management Services Runtime 9.1.0.0
VCF Automation 9.1.0.0

Cause

During graceful shutdown, the internal cluster management service may mark certain nodes for deletion. After power-on, these nodes remain in a cordoned state, preventing the automatic recovery service from completing. The recovery process waits for all nodes to be usable before scaling services back up, creating a deadlock for bring up.

Resolution

Step 1: Power on the cluster VMs

Note: For Automation, skip to step 5, all Node VMs are Control and Worker, order does not matter.

Open VCF Operations UI and navigate to Build > Lifecycle > Components tab.
Click on the VCF Services Runtime link.
Scroll down the page to the Nodes section.
Identify the Control Plane nodes.
In vCenter, navigate to VMs and Templates view.
Locate the appropriate VM folder:
- VCF Management Services cluster: vcf-management-services folder
- VCF Automation cluster: vcf-automation folder
Select all Control Plane VMs in the cluster.
Right-click and select Power → Power On.
Once powered on, select the Worker Node VMs and power those on.

Step 2: Wait for automatic recovery

Allow 15-20 minutes for the automatic recovery process to complete. The platform includes a systemd service that automatically scales services back to their original replica counts.

Step 3: Validate cluster recovery

Verify the cluster is operational by checking:

UI Access: Navigate to Fleet Lifecycle Manager UI (for management clusters) or VCF Automation services UI (for automation clusters). The UI should be accessible and responsive
Service Status: If UI access is unavailable, engage Broadcom Technical Support.

Step 4: Troubleshoot if automatic recovery fails

If services do not recover automatically after 20 minutes, manual intervention is required.

Access the cluster using breakglass

Identify a control plane node IP address from vCenter
SSH to the control plane node as vmware-system-user
Login with breakglass password
Switch to root: sudo -i
Set kubeconfig: export KUBECONFIG=/etc/kubernetes/admin.conf

Validate pod status and service status

# Check node status - look for nodes stuck in SchedulingDisabled state

kubectl get nodes

# Check if power-off-marker still exists (indicates recovery not completed)

kubectl get configmap power-off-marker -n vmsp-platform

# Check pod status across all namespaces

kubectl get pods -A | grep -v Running

Manual recovery if in bad state

If nodes are stuck in Ready,SchedulingDisabled state and the power-off-marker ConfigMap exists, then you will need to run the following script on that same node for manual recovery after setting the KUBECONFIG variable:

"cluster-manual-recovery.sh"

Step 5: Validate final state

After manual recovery, verify:

All pods show Running status: kubectl get pods -A
Fleet Lifecycle Manager UI is accessible (for management clusters)
VCF Automation services are accessible (for automation clusters)

Attachments

cluster-manual-recovery.sh get_app

Feedback

thumb_up Yes

thumb_down No