NSX T "Get Cluster Status" command fails

Products

VMware NSX

Issue/Introduction

This symptom can occur when querying the cluster status of an NSX T Management cluster if the cluster recently recovered from a period of time in which the nodes were unable to properly communicate with one another. Example return is below.

nsxapp-01a> get cluster status

% The get cluster status operation cannot be processed currently, please try again later

While in this state, it is likely that detaching a node from the cluster will also fail to function.

nsxapp-01a> detach node <UUID>

% An error occurred while detaching the specified node. Reason: [CBM110] Detach cluster node is not allowed because it will cause datastore to lose cluster quorum. Please use 'deactivate cluster' operation to get single node datastore cluster.

Activity framework failures can be viewed within the nsxapi.log on the NSX Manager.

2020-05-19T06:03:31.450Z ERROR L2TaskExecutor5 L2MessagingServiceImpl$LocalRpcRouter - SWITCHING [nsx@6876 comp="nsx-manager" errorCode="MP100" level="ERROR" subcomp="manager"] L2MsgService.handleMessage: Failed to handle message
com.vmware.nsx.management.container.activityframework.exceptions.ActivityFailedToCompleteException: Activity com.vmware.nsx.management.switching.sync.host.TransportNodeSyncNotifyActivity/4xxxxxx1-9xxx-xxxx-axx8-dxxxxxxxxxx4 failed to complete during the specified time duration 300 sec.

Environment

VMware NSX-T Data Center

Cause

After cluster recovery, the activity framework is encumbered by a large amount of tasks. This large amount of tasks leads to queued tasks which are never accomplished within their 5 minute allocation. This cycle creates additional tasks to clean up the failed tasks.

The amount of tasks currently in the activity framework's queue can be observed with the following API call:

GET https://<NSX-Manager-IP>/api/v1/operational/activityframework/scheduler/statistics (or login to NSX manager as root user and run the equivalent curl -k -u '<username>:<password>' -H 'Content-Type: application/json' -X GET "https://<nsx-mngr-ip>/api/v1/operational/activityframework/scheduler/statistics")

Resolution

This issue is fixed in NSX T 3.0 and as well as will be included in the upcoming 2.5.2 release.

Workaround:
To reduce load on the activity framework and expedite it's self recovery, one may enact the following steps:

1.) Check activity framework load on each NSX Manager: GET https://<NSX-Manager-IP>/api/v1/operational/activityframework/scheduler/statistics (or login to NSX manager as root user and run the equivalent curl -k -u '<username>:<password>' -H 'Content-Type: application/json' -X GET "https://<nsx-mngr-ip>/api/v1/operational/activityframework/scheduler/statistics")

2) Login to all transport nodes as a root user and stop ops-agent service (/etc/init.d/nsx-opsagent on ESXi hosts , /etc/init.d/nsx-opsagent-appliance on Edge Nodes). Ops-Agent is a heart beat service that is especially taxing to the activity framework prior to NSX T 3.0.0 and NSX T 2.5.2.

3.) Reboot NSX T Managers one by one.

4.) 1.) Check activity framework load on each NSX Manager once more: GET https://<NSX-Manager-IP>/api/v1/operational/activityframework/scheduler/statistics (or login to NSX manager as root user and run the equivalent curl -k -u '<username>:<password>' -H 'Content-Type: application/json' -X GET "https://<nsx-mngr-ip>/api/v1/operational/activityframework/scheduler/statistics")

It should be much lower now, allowing cluster activities to be completed.

5.) Start ops-agent on all Transport Nodes once more.