VMware Aria Operations Product and Admin UIs Unresponsive After Login due to P2P SSL Handshake Race Condition
search cancel

VMware Aria Operations Product and Admin UIs Unresponsive After Login due to P2P SSL Handshake Race Condition

book

Article ID: 442540

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

Users experience unresponsive UI behavior in VMware Aria Operations after a successful login using local admin credentials.

Symptoms include:

  • Product UI: Hangs indefinitely, displaying the message "Redirecting to VMware Aria Operations Web UI..."

  • Admin UI: Hangs indefinitely, displaying the message "Retrieving cluster status..."

  • Users can successfully reach and authenticate at the login pages, but the interfaces fail to load further.

  • The production cluster displays a "loading" status.

Environment

 

  • VMware Aria Operations 8.18.5

  • Continuous Availability (CA) design enabled

 

Cause

The root cause is a race condition exclusive to the Peer-to-Peer (P2P) SSL handshake between cluster members, which results in a JVM deadlock.

This deadlock silently accumulates stuck threads over time. Once a significant number of threads become stuck (e.g., thousands over a span of several weeks), it causes peer nodes to crash, ultimately resulting in a loss of cluster quorum and unresponsive interfaces.

Resolution

Permanent Fix: This issue is resolved in VMware Cloud Foundation (VCF) Operations 9.1, which includes an upgrade to GemFire 10.1.3 containing the permanent fix.

Workaround: If an immediate upgrade to VCF Ops 9.1 is not possible, the following workaround can be applied to stabilize the environment.
Note: This is only recommended if your environment's resource count fits within single-node sizing limits (e.g., ~3,000 resources).

  1. Disable Continuous Availability (CA) and shrink the cluster to a single node.

  2. Why this works: By running as a single-node deployment, there are no inter-node P2P connections. Without these connections, the P2P handshake reader threads are never spawned, making the deadlock structurally impossible at the architectural level.

  3. Sizing considerations: A 3,000 resource workload is well within single-node sizing limits for standard hardware (32 GB RAM, 8 vCPU, 15 GB JVM heap). For added safety margin, the remaining node can be scaled up to a LARGE configuration.

  4. Revert: The trade-off for this workaround is the temporary loss of high availability. CA should be re-enabled once the environment is upgraded to VCF Ops 9.1.