CA cluster deployment stuck at starting the cluster - VCF Operations
search cancel

CA cluster deployment stuck at starting the cluster - VCF Operations

book

Article ID: 435086

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

When deploying a new VCF Operations cluster or expanding an existing one with High Availability (HA) or Continuous Availability (CA), you may experience the following symptoms:

  • The cluster status in the Administration UI remains stuck at Starting the Cluster during initial installation.
  • Individual nodes (Primary, Replica, or Data nodes) display a status of Waiting for Analytics
  • Analytics logs (/storage/vcops/log/analytics-<UUID>.log) show errors such as: com.integrien.analytics.AnalyticsMain.createGemfireCache - Can not connect to gemfire: Problem starting up membership services. 

Environment

VCF Operations 9.x

Cause

This issue is typically caused by incomplete network connectivity between the cluster nodes. VCF Operations requires specific internal ports to be open for GemFire membership services and data synchronization, even if the nodes reside on different VLANs or subnets

Resolution

Ensure the network environment and appliance configuration meet the following requirements:

1. Verify Port Connectivity

Work with your network team to ensure that the following ports are open between the nodes in the cluster (Note: Primary / Replica nodes are also classified as Data Nodes):

 
 
Source NodeDestination NodePort(s)Usage
Data NodePrimary / ReplicaTCP 6061Communication with Geode Locator
Data NodePrimary / ReplicaTCP 20002 - 20010Geode TCP inter-node failure detection & peer-to-peer communication
Data NodePrimary / ReplicaTCP 5433Communication with Postgres Central DB
Data NodeData NodeTCP 443HTTPS
Data NodeData NodeTCP 10000Communication with Geode server embedded in Analytics process
Data NodeData NodeTCP 10002 - 10010Geode TCP inter-node failure detection & peer-to-peer communication
Data NodeWitness NodeTCP 443HTTPS
Witness NodeData NodeTCP 443HTTPS

2. Validate with Netcheck Utility

Run the internal Netcheck.py script from the appliance CLI to identify specific connection failures:

/usr/bin/python /usr/lib/vmware-vcopssuite/python/lib/Netcheck.py

If any ports return a FAILED status, the firewall or routing configuration must be adjusted.

3. Configure DNS and Reverse DNS

Ensure that all nodes have forward (A) and reverse (PTR) records configured in your DNS server. Missing PTR records often lead to the Waiting for Analytics state as nodes fail to resolve each other during the join process.

4. Cross-Subnet Routing

If nodes are deployed across different IP networks or VLANs:

  • Confirm that latency between nodes is within supported limits (typically < 5ms/10ms for HA/CA).
  • Ensure that MTU settings are consistent across the network path to prevent packet fragmentation.