NSX Manager VIP Is not accessible and Individual Node showing alarm for Cluster Degraded/Unavailable.
search cancel

NSX Manager VIP Is not accessible and Individual Node showing alarm for Cluster Degraded/Unavailable.

book

Article ID: 432373

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The NSX Manager Cluster Virtual IP (VIP) is intermittent or completely inaccessible via UI. However, individual NSX Manager node IP addresses remain reachable.
  • Frequent alarms are generated in the NSX Manager UI indicating the cluster is "Degraded" or "Unavailable." These alarms often resolve themselves automatically. 



  • The /var/log/syslog shows intermittent flapping of the manager service.

    2025-02-23T13:36:08.826Z NSX_Manager MONITORING [nsx@6876 comp="nsx-manager" entId="node_ID" eventFeatureName="clustering" eventSev="error" eventState="On" eventType="cluster_unavailable" level="ERROR" subcomp="cbm"] All group members ####3542-bebd-67ae-####-d2c6####d45,de0####6-a5e4-40##-97a7-27####6f26aa,1####6e9-9d81-####2-9057-d####cf2a7##of service MANAGER are down.

    2025-02-23T13:36:13.826Z ambrcpnsxmg01 NSX 75472 MONITORING [nsx@6876 comp="nsx-manager" entId="node_ID" eventFeatureName="clustering" eventSev="error" eventState="On" eventType="cluster_unavailable" level="ERROR" subcomp="cbm"] All group members ####3542-bebd-67ae-####-d2c6####d45,de0####6-a5e4-40##-97a7-27####6f26aa,1####6e9-9d81-####2-9057-d####cf2a7##of service MANAGER are up.

Environment

VMware NSX

Cause

  • NSX Proton services were periodically restarting due to GMLE leadership safety violations caused by high disk I/O latency (confirmed by w_await stats, fsync timers, and ESX SCSI device warnings) preventing timely lease renewals.

    Could see the logs for the GMLE voilation from the proton restart logs (var/log/proton/proton_restart.log):
    INFO application-restartor restartor 2894022 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====
    INFO application-restartor restartor 2678549 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====
    INFO application-restartor restartor 2700052 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====
    INFO application-restartor restartor 2711738 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (Disconnected from database) =====

  • Due to I/O latency GMLE fails to renew lease before it expires. This is confirmed from the /var/log/proton/nsxapi.log as below:

    INFO GmleClientNonBlockingOpsThread-1 RenewingState 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="renew-state" subcomp="manager"] Failed to renew lease before lease expiration for service POLICY_SVC_POLICY_UC_DEPLOYMENT_CONFIG on member de####-a5e4-####97a7-####86f#### of group 1####dcf-35####edd-f7####67fe##, invoking the client safety violation handler

    INFO GmleClientNonBlockingOpsThread-1 LeaseState 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="lease-state" subcomp="manager"] Stopping FSM, transitioning to StoppedState after safety violation for service POLICY_SVC_POLICY_UC_DEPLOYMENT_CONFIG on member d####-a5e4-####-97a7-####6f26## of group 1f###56-7###-35###-bedd-f7a###7fe##


    INFO application-restartor ContainerConfigServiceImpl 3539225 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Sending application restart request.

  • The latency further confirmed from the ESX  host vmkernel or vmkwarning logs (/var/run/log):
    Wa(180) vmkwarning: cpu71 WARNING: ScsiDeviceIO: 1780: Device naa.60002ac00########0002a## performance has deteriorated. I/O latency increased from average value of 793 microseconds to 16834 microseconds.

Resolution

  • Investigate the storage backend (SAN/NAS/vSAN) for performance bottlenecks, hardware failures, or high-utilization periods (e.g., backup windows).
  • Repeated service crashes due to I/O timeouts can occasionally lead to database corruption. Inspect /var/log/corfu/corfu.9000.log on all three manager nodes for the following strings:

    /var/log/corfu/corfu.9000.log shows error:
    <Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | CorfuServer: Server exiting due to unrecoverable error: org.corfudb.runtime.exceptions.DataCorruptionException: Checksum mismatch detected while trying to read file

    Or error:  

    <Time-stamp>| ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server
    <Time-stamp> | ERROR | WrapperSimpleAppMain | o.c.infrastructure.CorfuServer | Failed starting server org.corfudb.runtime.exceptions.DataCorruptionException: Can't parse metadata. Segment File: /config/corfu/log/297779.log. File size: 3872130. File position: 3871666
    ...
    ...
    Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

  • If data corruption is confirmed in the logs, follow the specific recovery procedure outlined in the official Broadcom/VMware Knowledge Base: 
    Broadcom KB 303324: Corfu recovery procedure for NSX
  • If there are no corruption found on the corfu, please gather the NSX support bundle and open a support request with Broadcom - Creating and managing Broadcom cases.