NSX Manager cluster intermittently goes into degraded state and NSX UI becomes inaccessible with error code 101

Products

VMware NSX

Issue/Introduction

The following error is shown access NSX Manager UI:
Some appliances components are not functioning properly.
Components health: MANAGER:DOWN, UI:DOWN
Error code: 101
Running 'get cluster status' command shows HTTPS and MANAGER as DOWN and intermittently flapping among the nodes in the cluster.
A rolling reboot of the NSX Managers will not resolve the issue.
CPU usage on the Manger VM's is very high. If you run the top command on the manager then you see that the uproton service is consuming most of the CPU:
Instances of Proton service running out of memory may be seen in the /var/log/proton/proton-tomcat-wrapper.log:
STATUS | wrapper | <Timestamp> | The JVM has run out of memory. Requesting thread dump.
proton_oom.hprof core file can be found in /image/core on each MP node.
Many dirty markers created by edge cluster on MP can be seen in the /var/log/proton/nsxapi.log
INFO workerTaskExecutor-1-34 DirtyObjectMarkerService 2796604 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Created dirty object marker per provider:DirtyObjectMarkerPerProvider/#######-####-####-####-796217de54f3;/infra/sites/default/enforcement-points/default/edge-clusters/########-####-####-####-678db13be4f9;
In the /var/log/corfu/corfu-compactor-audit.log you can see that it is taking a long time to compact the DOM (Dirty Object Marker) table and that the number of entries maybe in the millions. The UUID for the DOM table is 'a8763670-7e7d-3d68-9a9e-d2df7f778695' and if you track these log messages over time then you see that the number of entries continues to increase, which means the DOM table continues to grow:

| INFO | Cmpt-chkpter-9000 | o.c.runtime.CheckpointWriter | appendCheckpoint: Started checkpoint for a8763670-7e7d-3d68-9a9e-d2df7f778695 at snapshot Token(epoch=3225, sequence=11967155443)
| INFO | Cmpt-chkpter-9000 | o.c.runtime.CheckpointWriter | appendCheckpoint: completed checkpoint for a8763670-7e7d-3d68-9a9e-d2df7f778695, entries(3457925), cpSize(1805366234) bytes at snapshot Token(epoch=3225, sequence=11967155443) in 1892591 ms
From the Edge Support Bundle, in /edge/edge-client we see that Edge has not received ACK from MP:
"system_info": {
        "tx_to_mp_error": 3843,
        "ack_from_mp_error": 0,
       "tx_to_mp_time": "<Timestamp>",
        "tx_to_mp": 172303,
       "ack_from_mp_time": "<Timestamp>",
        "ack_from_mp": 119488 ----> Edge has only received ACK for 119488 compared to 172303 messages Edge has sent.
    },

"config_update": {
        "tx_to_mp_error": 3909,
        "ack_from_mp_error": 0,
       "tx_to_mp_time": "<Timestamp>",
        "tx_to_mp": 172412,
       "ack_from_mp_time": "<Timestamp>",
        "ack_from_mp": 119151, ----> Edge has only received ACK for 119151 compared to 172412 messages it has sent.
        "notification": 4
    },
- This led to the Edges sending EdgeSystemInfoMsg and EdgeConfigUpdateMsg every 5 seconds continuously:
  /var/log/syslog - NSX 2860 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="2978" level="INFO"] Successfully sent EdgeSystemInfoMsg
  /var/log/syslog - NSX 2860 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="edge-service" tid="2978" level="INFO"] EdgeConfigUpdateMsg sent successfully

Environment

VMware NSX 4.x

Cause

In large-scale Edge deployments, a known issue exists where Edge nodes send a high volume of configuration and system update messages to the NSX MP. This often consumes a significant amount of MP memory, leading to an inability of the MP to respond to these updates.
Consequently, each missed acknowledgment from the MP triggers a retry from the Edge nodes, further saturating the MP message queue responsible for handling these updates.
Retries are creating DirtyObjectMarker entries at a high rate and the NSX Manager gets overwhelmed loading/ processing the DOM table in Corfu and leads to OOM (Out Of Memory) condition on the CCP.

Resolution

This issue is resolved in VMware NSX 4.2.3 available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

Increasing the form factor of the NSX managers to Extra Large (XL) would give enough resources to NSX manager to process the table. This operation will not fix the issue, but will buy time while the upgrade is being scheduled. The steps for resizing a NSX Manager can be found in the technical documentation. If resizing a manager in a VCF environment then consult the KB 314670 as well.
Another workaround involves stopping the OpsAgent (stop service nsx-opsagent) on Standby Edges, which will reduce the number of messages being triggered towards the NSX manager. Once the NSX Managers Cluster is stable, we can start the OpsAgent again on the Standby Edges (start service nsx-opsagent)
A script can be provided to identify the Edges which are the largest contributors to the issue, so it is easier to target which Edges to stop the OpsAgent on. If you believe you have encountered this issue and the script would be helpful, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.

Additional Information

Events that can contribute to the triggering of an increase in the number of update messages from the Edges and the growth of the DOM table are edge reboots, edge config updates, manager reboots and MPA connectivity issues.
Also refer to this KB Controller Connectivity Down on newly created Edge and Configuration State Node Not Ready

NSX Manager cluster intermittently goes into degraded state and NSX UI becomes inaccessible with error code 101

Article ID: 405048

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Additional Information

Feedback