NSX T "Get Cluster Status" command fails

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
This symptom can occur when querying the cluster status of an NSX T Management cluster if the cluster recently recovered from a period of time in which the nodes were unable to properly communicate with one another. Example return is below.

nsxapp-01a> get cluster status

% The get cluster status operation cannot be processed currently, please try again later

While in this state, it is likely that detaching a node from the cluster will also fail to function.

nsxapp-01a> detach node <UUID>

% An error occurred while detaching the specified node. Reason: [CBM110] Detach cluster node is not allowed because it will cause datastore to lose cluster quorum. Please use 'deactivate cluster' operation to get single node datastore cluster.

Activity framework failures can be viewed within the nsxapi.log on the NSX Manager.

2020-05-19T06:03:31.450Z ERROR L2TaskExecutor5 L2MessagingServiceImpl$LocalRpcRouter - SWITCHING [nsx@6876 comp="nsx-manager" errorCode="MP100" level="ERROR" subcomp="manager"] L2MsgService.handleMessage: Failed to handle message
com.vmware.nsx.management.container.activityframework.exceptions.ActivityFailedToCompleteException: Activity com.vmware.nsx.management.switching.sync.host.TransportNodeSyncNotifyActivity/4xxxxxx1-9xxx-xxxx-axx8-dxxxxxxxxxx4 failed to complete during the specified time duration 300 sec.

Cause

After cluster recovery, the activity framework is encumbered by a large amount of tasks. This large amount of tasks leads to queued tasks which are never accomplished within their 5 minute allocation. This cycle creates additional tasks to clean up the failed tasks.

The amount of tasks currently in the activity framework's queue can be observed with the following API call:

GET https://<NSX-Manager-IP>/api/v1/operational/activityframework/scheduler/statistics

Resolution

This issue is fixed in NSX T 3.0 and as well as will be included in the upcoming 2.5.2 release.

Workaround:
To reduce load on the activity framework and expedite it's self recovery, one may enact the following steps:

1.) Check activity framework load on each NSX Manager: GET https://<NSX-Manager-IP>/api/v1/operational/activityframework/scheduler/statistics

2) Login to all transport nodes and stop ops-agent service (/etc/init.d/nsx-opsagent on ESXi hosts , /etc/init.d/nsx-opsagent-appliance on Edge Nodes). Ops-Agent is a heart beat service that is especially taxing to the activity framework prior to NSX T 3.0.0 and NSX T 2.5.2.

3.) Reboot NSX T Managers one by one.

4.) 1.) Check activity framework load on each NSX Manager once more: GET https://<NSX-Manager-IP>/api/v1/operational/activityframework/scheduler/statistics

It should be much lower now, allowing cluster activities to be completed.

5.) Start ops-agent on all Transport Nodes once more.