"Cluster Degraded" and "Cluster Unavailable" alarms cannot be cleared in a healthy environment
search cancel

"Cluster Degraded" and "Cluster Unavailable" alarms cannot be cleared in a healthy environment

book

Article ID: 373240

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX UI shows alarms for Cluster Degraded and/or Cluster Unavailable
  • If the alarm is manually resolved, it reappears.
  • NSX UI or CLI show the cluster status is STABLE and UP
    get cluster status
  • Log lines similar to the below are encountered on the NSX Manager in /var/log/phonehome-coordinator/phonehome-coordinator.log

    WARN pool-88-thread-3 MonitoringServiceImpl 74793 MONITORING [nsx@6876 alarmId="########-####-####-####-############" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" eventFeatureName="clustering" eventSev="MEDIUM" eventState="On" eventType="cluster_degraded" level="WARNING" nodeId="########-####-####-####-############" subcomp="monitoring"] Group member ########-####-####-####-############ of service ######## is down.
    INFO pool-88-thread-3 MonitoringEventInstanceProcessor 74793 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Alarm for event clustering.cluster_degraded, node ########-####-####-####-############, entity id ########-####-####-####-############ does not exist , creating new alarm
    INFO pool-88-thread-3 MonitoringEventInstanceProcessor 74793 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Context for alarm with eventid clustering.cluster_degraded and entity id ########-####-####-####-############ is {"group_type":"########","manager_node_id":"########-####-####-####-############"}

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

This is a false-positive alarm. Historically, there was some event that impacted the cluster. The NSX Manager node that was the CBM service leader at the time of the alarm was later replaced. 

Resolution

This is a known issue impacting VMware NSX.

Workaround:

  1. Identify the Node that reported the Alarm
    1. Access the UI -> Alarms
    2. Expand the alarm
    3. Look for the field "Reported by Node" and take note of the node name
  2. Restart the CBM service
    1. SSH as admin to the manager node identified in the previous step
    2. Confirm the CBM service is running
      get service cluster_manager
    3. Restart the CBM (nsx-cluster-boot-manager) service:
      restart service cluster_manager
    4. Confirm the service has restarted (is running):
      get service cluster_manager
  3. Manually resolve the alarm via API
    If the alarm reappears after having restarted the mentioned service that's reported down and CBM, follow this KB to manually suppress and/or resolve the alarm via API
    Manually resolve, acknowledge or suppress alarm on NSX Standby Global Manager