"Cluster Degraded" and "Cluster Unavailable" alarms cannot be cleared in a healthy environment

search cancel

"Cluster Degraded" and "Cluster Unavailable" alarms cannot be cleared in a healthy environment

book

Article ID: 373240

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX UI shows alarms for Cluster Degraded and/or Cluster Unavailable
Cluster may report degraded due to one or more group members reporting CLUSTER_MANAGER is down.
If the alarm is manually resolved, it reappears.
NSX UI or CLI show the cluster status is STABLE and UP
```
get cluster status
```
Log lines similar to the below are encountered on the NSX Manager in /var/log/phonehome-coordinator/phonehome-coordinator.log

WARN pool-88-thread-3 MonitoringServiceImpl 74793 MONITORING [nsx@6876 alarmId="########-####-####-####-############" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" eventFeatureName="clustering" eventSev="MEDIUM" eventState="On" eventType="cluster_degraded" level="WARNING" nodeId="########-####-####-####-############" subcomp="monitoring"] Group member ########-####-####-####-############ of service ######## is down.
INFO pool-88-thread-3 MonitoringEventInstanceProcessor 74793 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Alarm for event clustering.cluster_degraded, node ########-####-####-####-############, entity id ########-####-####-####-############ does not exist , creating new alarm
INFO pool-88-thread-3 MonitoringEventInstanceProcessor 74793 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Context for alarm with eventid clustering.cluster_degraded and entity id ########-####-####-####-############ is {"group_type":"########","manager_node_id":"########-####-####-####-############"}

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

This is a false-positive alarm. Historically, there was some event that impacted the cluster. The NSX Manager node that was the CBM service leader at the time of the alarm was later replaced.

Resolution

This is a known issue impacting VMware NSX.

Workaround:

Identify the Node that reported the Alarm
1. Access the UI -> Alarms
2. Expand the alarm
3. Look for the field "Reported by Node" and take note of the node name
Restart the CBM service
1. SSH as admin to the manager node identified in the previous step
2. Confirm the CBM service is running
```
get service cluster_manager
```
3. Restart the CBM (nsx-cluster-boot-manager) service:
```
restart service cluster_manager
```
4. Confirm the service has restarted (is running):
```
get service cluster_manager
```
Manually resolve the alarm via API
If the alarm reappears after having restarted the mentioned service that's reported down and CBM, follow this KB to manually suppress and/or resolve the alarm via API
Manually resolve, acknowledge or suppress alarm on NSX Standby Global Manager

Feedback

thumb_up Yes

thumb_down No