NSX Manager VIP Is not accessible and Individual Node showing alarm for Cluster Degraded/Unavailable.

Products

VMware NSX

Issue/Introduction

The NSX Manager Cluster Virtual IP (VIP) is intermittent or completely inaccessible via UI. However, individual NSX Manager node IP addresses remain reachable.
Frequent alarms are generated in the NSX Manager UI indicating the cluster is "Degraded" or "Unavailable." These alarms often resolve themselves automatically.
The /var/log/syslog shows intermittent flapping of the manager service.

2025-02-23T13:36:08.826Z NSX_Manager MONITORING [nsx@6876 comp="nsx-manager" entId="node_ID" eventFeatureName="clustering" eventSev="error" eventState="On" eventType="cluster_unavailable" level="ERROR" subcomp="cbm"] All group members ####3542-bebd-67ae-####-d2c6####d45,de0####6-a5e4-40##-97a7-27####6f26aa,1####6e9-9d81-####2-9057-d####cf2a7##of service MANAGER are down.
2025-02-23T13:36:13.826Z ambrcpnsxmg01 NSX 75472 MONITORING [nsx@6876 comp="nsx-manager" entId="node_ID" eventFeatureName="clustering" eventSev="error" eventState="On" eventType="cluster_unavailable" level="ERROR" subcomp="cbm"] All group members ####3542-bebd-67ae-####-d2c6####d45,de0####6-a5e4-40##-97a7-27####6f26aa,1####6e9-9d81-####2-9057-d####cf2a7##of service MANAGER are up.

Environment

VMware NSX

Cause

NSX Proton services were periodically restarting due to GMLE leadership safety violations caused by high disk I/O latency (confirmed by w_await stats, fsync timers, and ESX SCSI device warnings) preventing timely lease renewals.

Could see the logs for the GMLE voilation from the proton restart logs (var/log/proton/proton_restart.log):
INFO application-restartor restartor 2894022 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====
INFO application-restartor restartor 2678549 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====
INFO application-restartor restartor 2700052 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====
INFO application-restartor restartor 2711738 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (Disconnected from database) =====
Due to I/O latency GMLE fails to renew lease before it expires. This is confirmed from the /var/log/proton/nsxapi.log as below:

INFO GmleClientNonBlockingOpsThread-1 RenewingState 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="renew-state" subcomp="manager"] Failed to renew lease before lease expiration for service POLICY_SVC_POLICY_UC_DEPLOYMENT_CONFIG on member de####-a5e4-####97a7-####86f#### of group 1####dcf-35####edd-f7####67fe##, invoking the client safety violation handler
INFO GmleClientNonBlockingOpsThread-1 LeaseState 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="lease-state" subcomp="manager"] Stopping FSM, transitioning to StoppedState after safety violation for service POLICY_SVC_POLICY_UC_DEPLOYMENT_CONFIG on member d####-a5e4-####-97a7-####6f26## of group 1f###56-7###-35###-bedd-f7a###7fe##

INFO application-restartor ContainerConfigServiceImpl 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Sending application restart request.
The latency further confirmed from the ESX host vmkernel or vmkwarning logs (/var/run/log):
Wa(180) vmkwarning: cpu71 WARNING: ScsiDeviceIO: 1780: Device naa.60002ac00########0002a## performance has deteriorated. I/O latency increased from average value of 793 microseconds to 16834 microseconds.

Resolution

Investigate the storage backend (SAN/NAS/vSAN) for performance bottlenecks, hardware failures, or high-utilization periods (e.g., backup windows).
Repeated service crashes due to I/O timeouts can occasionally lead to database corruption. Inspect /var/log/corfu/corfu.9000.log on all three manager nodes for the following strings:
/var/log/corfu/corfu.9000.log shows error:
<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error: org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file

Or error:
<Time-stamp>| ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server
<Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666
...
...
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
If data corruption is confirmed in the logs, follow the specific recovery procedure outlined in the official Broadcom/VMware Knowledge Base:
Broadcom KB 303324: Corfu recovery procedure for NSX
If there are no corruption found on the corfu, please gather the NSX support bundle and open a support request with Broadcom - Creating and managing Broadcom cases.

NSX Manager VIP Is not accessible and Individual Node showing alarm for Cluster Degraded/Unavailable.

Article ID: 432373

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback