NSX Manager "Cluster Degraded" alert with Manager service restarts.
search cancel

NSX Manager "Cluster Degraded" alert with Manager service restarts.

book

Article ID: 422186

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Reports of services going down on the NSX Manager node are observed with entries similar to the below in /var/log/cbm/cbm.log:
    WARN EventReportProcessor-1-2 EventReportSyslogSender 75405 MONITORING [nsx@6876 comp="nsx-manager" entId="########-####-####-####-############" eventFeatureName="clustering" eventSev="warning" eventState="On" eventType="cluster_degraded" level="WARNING" subcomp="cbm"] Group member ########-####-####-####-############ of service MANAGER is down.

    WARN EventReportProcessor-1-1 EventReportSyslogSender 75405 MONITORING [nsx@6876 comp="nsx-manager" entId="########-####-####-####-############" eventFeatureName="clustering" eventSev="warning" eventState="On" eventType="cluster_degraded" level="WARNING" subcomp="cbm"] Group member ########-####-####-####-############ of service HTTP is down.
  • An NSX Manager Node reports "Member Down" in /var/log/proton/nsxapi.log:
    INFO NotificationThread CoordinationServiceImpl 77150 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Member DOWN: ClusterNodeConfigModel/########-####-####-####-############, heartbeatCycleId: ########-####-####-####-############
  • Disconnects can be observed for Corfu in var/log/corfu/corfu.#.#.log:
    | WARN  |              DetectionWorker-0 |         o.c.i.m.f.EpochHandler | updateTrailingLayoutServers: layout fetch from ##.##.##.##:9040 failed: org.corfudb.runtime.exceptions.NetworkException: Disconnected (LAYOUT_REQUEST) [endpoint=##.##.##.##:9040]
  • Reviewing /var/log/stats/ping.stats in the NSX Manager support bundle will show no loss of ping
  • A period of high CPU load (which is a relative value to the vCPU count of the VM, not a percentage) with the NSX Manager queuing task with no storage latency apparent are observed in /var/log/vmware/top-cpu.log:
    <Day> <Month> <Date> <Time> UTC 2025
    top - <time> up 4 days,  6:55,  0 users, load average: 68.71, 25.84, 10.80
    Tasks: 367 total,  10 running, 357 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 73.3 us, 14.1 sy,  0.0 ni,  9.5 id,  0.0 wa,  0.0 hi,  3.1 si,  0.0 st
    KiB Mem : 49295268 total,  3117108 free, 29650160 used, 16528000 buff/cache
    KiB Swap:        0 total,        0 free,        0 used. 19004784 avail Mem

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

This behavior may be encountered due to resource contention on the ESXi host that the NSX Manager is deployed on. 

Resolution

This is a condition that may occur in a VMware NSX environment.

As outlined in the VMware NSX Cluster Requirements

  • Place your NSX appliances on different hosts (from anti-affinity rules in VMware vCenter) to avoid a single host failure impacting multiple managers.
  • It is recommended that you place NSX appliances in different management subnets or a shared management subnet. When using vSphere HA it is recommended to use a shared management subnet so NSX appliances that are recovered by vSphere can preserve their IP address.
  • It is recommended that you place NSX appliances on shared storage also.