VCF Operations CA cluster failed to come online after a network outage
search cancel

VCF Operations CA cluster failed to come online after a network outage

book

Article ID: 418417

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Datacenter experienced a network outage, but the CA cluster did not recover after networking was restored
  • Primary and replica node believe that they are the primary node

Environment

VCF Operations 9.x

Cause

  • During the network outage in the Datacenter, connectivity between Region_A , Region_B , and the witness was lost.
    The witness correctly decided to keep Region_A online and mark Region_B offline:
    {"REGION_A":"true","REGION_B":"false"} . However, the witness later reported that the shutdown of Region_B failed, which was visible in the in /storage/log/vcops/log/casa/casa.log 
    ERROR casa [com.vmware.workflow.utils.DecisionMakerTask.run:###] - The shutdown of REGION_B timed out
  • This timeout occurred because Region_A attempted to update the cluster membership and cached roles documents to reflect Region_B’s offline status, but all communication attempts to Datacenter failed with network errors in /storage/log/vcops/log/casa/casa.log:
    java.net.SocketTimeoutException: Connect timed out
    org.springframework.web.client.ResourceAccessException: No route to host

As a result, Region_B never received the update and continued running locally.

  • Witness node initiated shutdowns for both regions instead of one, due to a timing issue in Region_B’s shutdown completion.  But the witness node did not wait for Region_B’s shutdown workflow to fully complete before proceeding to take Region_A offline.  So both regions were marked offline.  
  • When connectivity was restored, both regions believed they were primary, leading to a split-brain condition that required a manual reboot to recover as it could not recover automatically.  /storage/log/vcops/log/casa/casa.log show similar entries to below:
    • Witness initiated shutdown of Region_B first: 
      2025-10-08T06:41:37.#### INFO casa [...] SLICE_ONLINE_STATE: request: {"online_state":"OFFLINE", ...} 2025-10-08T06:41:38.#### INFO casa [...] Taking slice offline: Putting OFFLINE CA region
      Region_B (###.###.###.###) received the shutdown request and began going offline.
    • Witness started Region_A shutdown before Region_B confirmed completion in /storage/log/vcops/log/casa/casa.log:
       2025-10-08T06:45:41.### INFO casa [...] Successfully put all nodes OFFLINE 2025-10-08T06:45:41.165Z INFO casa [...] Update CA Regions state: REGION_A: OFFLINE 

      This shows the witness completed Region_A’s shutdown while Region B’s workflow was still ongoing

    • Region_B finally reported in /storage/log/vcops/log/casa/casa.log that its shutdown completed after Region_A had already reported as offline :
      2025-10-08T06:45:53.#### INFO casa [...] CA-CLUSTER-OFFLINE-WORKFLOW: Nodes are successfully took Offline, update Region State to OFFLINE. 2025-10-08T06:45:53.467Z INFO casa [...] Update CA Regions state: REGION_B: OFFLINE

Resolution

  1. Take snapshot as per kb Snapshot Creation in VMware Aria Operations 
  2. Reboot the cluster as per KB Rebooting nodes in Aria Operations