ESXi host transport node randomly going into "UNKNOWN" node status in NSX UI on NSX 3.1

Products

VMware NSX

Issue/Introduction

ESXi host transport nodes randomly goes into "UNKNOWN" node status in NSX UI with Controller Connectivity or PNIC/Bond Status or Tunnel Status or all of the going in unknown state.

When the impacted Host Transport Nodes are more in numbers, the "Unknown" node status gets automatically resolved for few hosts and others by host reboot. This behavior is highly intermittent and therefore the time for which issue persists or remains unabated cannot be determined.

In order to root cause such behavior, we need to collect debug logs by following strict guidelines as outlined below :

1. Enable the debug logging till we hit issue, wait for 10 mins or so, and then capture all MP nodes and affected Transport Node support bundle.

2. Debug logging needs to be enabled on all 3 manager nodes.

Step 1 : Originally default logging settings on NSX MP will look like this.

nsx-mngr> get service manager
Fri Jul 14 2023 UTC 08:57:04.389
Service name:      manager
Service state:     running
Logging level:     info

Step 2 : Enable the debug logging in manager for proton service in admin mode for all 3 manager nodes.

NOTES :

1. Enable the debug logging till we hit issue, wait for approximately 10 minutes. In other words, the problem state should persists during the package debug logging enablement period.

2. Because of debug logging, it is expected that nsxapi.log would be too chatty. Ensure nsxapi.log file on all manager nodes doesn't get rotated at the end when debug logging is disabled.)

root@nsx-mngr:/var/log/proton# ls -lhrt | grep nsxapi.
-rw-r----- 1 root adm 8.8M Feb 22 2022 nsxapi.1.log.gz
-rw-r----- 1 uproton uproton 161M Jun 21 06:49 nsxapi.10.log
-rw-r----- 1 uproton uproton 161M Jun 21 09:12 nsxapi.9.log
-rw-r----- 1 uproton uproton 161M Jun 21 11:37 nsxapi.8.log
-rw-r----- 1 uproton uproton 161M Jun 21 13:54 nsxapi.7.log
-rw-r----- 1 uproton uproton 161M Jun 21 16:14 nsxapi.6.log
-rw-r----- 1 uproton uproton 161M Jun 21 18:43 nsxapi.5.log
-rw-r----- 1 uproton uproton 161M Jun 21 21:16 nsxapi.4.log
-rw-r----- 1 uproton uproton 161M Jun 21 23:52 nsxapi.3.log
-rw-r----- 1 uproton uproton 161M Jun 22 02:29 nsxapi.2.log
-rw-r----- 1 uproton uproton 161M Jun 22 05:07 nsxapi.1.log
-rw-r----- 1 uproton uproton 78M Jun 22 05:55 nsxapi.log <<< In this example, debug logging is written to this file.

Continuously monitor the file-size growing ensuring the file doesn't get rotated..

root@nsx-mngr:/var/log/proton# ls -lhrt | grep nsxapi.log
-rw-r----- 1 uproton uproton 80M Jun 22 05:56 nsxapi.log

root@nsx-mngr:/var/log/proton# ls -lhrt | grep nsxapi.log
-rw-r----- 1 uproton uproton 89M Jun 22 05:56 nsxapi.log

root@nsx-mngr:/var/log/proton# ls -lhrt | grep nsxapi.log
-rw-r----- 1 uproton uproton 102M Jun 22 05:58 nsxapi.log

root@nsx-mngr:/var/log/proton# ls -lhrt | grep nsxapi.log
-rw-r----- 1 uproton uproton 139M Jun 22 06:03 nsxapi.log

There are 4 class of package debug logging that needs to be enabled on all manager nodes and they are :

A. com.vmware.nsx.management.agg.messaging.MessageDispatcherImpl
B. com.vmware.nsx.management.agg.messaging.IndependentDataCollectionManagerImpl
C. com.vmware.nsx.management.agg.messaging.StatusDataRequestMessageGeneratorImpl
D. com.vmware.nsx.management.node.service.TransportNodeStatusServiceImpl

Package debug classes B, C and D are used for debug on MP side.

Command Syntax :
nsx-mngr> set service manager logging-level debug package <package_class>

Example :
nsx-mngr> set service manager logging-level debug package com.vmware.nsx.management.agg.messaging.MessageDispatcherImpl
Logging level for package com.vmware.nsx.management.agg.messaging.MessageDispatcherImpl: debug

Output/validation after enablement of package debug class A :

nsx-mngr> get service manager
Wed Jun 21 2023 UTC 03:32:24.219
Service name:                        manager
Service state:                       running
Logging level:                       info
Modified package logging levels:     debug (com.vmware.nsx.management.agg.messaging.messagedispatcherimpl)

By using the similar Command Syntax and Example, enable package debug logging for remaining classes B, C and D.

Output/validation after enablement of all 4 package debug classes :

nsx-mngr> get service manager
Fri Jul 14 2023 UTC 08:46:09.180
Service name:                        manager
Service state:                       running
Logging level:                       info
Modified package logging levels:     info (org.corfudb.metricsdata), debug (com.vmware.nsx.management.agg.messaging.messagedispatcherimpl), debug (com.vmware.nsx.management.agg.messaging.independentdatacollectionmanagerimpl), debug (com.vmware.nsx.management.agg.messaging.statusdatarequestmessagegeneratorimpl), debug (com.vmware.nsx.management.node.service.transportnodestatusserviceimpl)

Step 3 : Capture all manager nodes support-bundle (include core files and audit logs) and affected Transport Node support-bundle.

Step 4 : Clear/disable debug logging in manager for proton service.

nsx-mngr> clear service manager logging-level

nsx-mngr> get service manager
Fri Jul 14 2023 UTC 08:57:04.389
Service name:      manager
Service state:     running
Logging level:     info

3. Log analysis from collected support-bundle.

a) Check for below logs from /var/log/proton/nsxapi.log.

Affected host transport node X > UUID aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee

TN(aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeee) CCP status is marked as EXPIRED.

Below log time stamp indicates the exact time when the host TN node status was set to "UNKNOWN".

nsxapi.log:2023-06-22T06:04:04.825Z INFO HeatMap-ConnCheck-Thread
HeatmapMsgServiceImpl 6286 MONITORING [nsx@6876 comp="nsx-manager" level="INFO"
subcomp="manager"] find node aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee , type
TransportNodeStatus , with queue index 1, channel AggServiceCheck, action
EXPIRED

b) For TN X(UUID : aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee), last CCP status is updated in DB at 2023-06-22T05:56:21.480Z as per below log.

2023-06-22T05:56:21.480Z DEBUG cvn-hv:TransportNodeStatusVertical-80780
TransportNodeStatusServiceImpl 6194 - [nsx@6876 comp="nsx-manager"
level="DEBUG" subcomp="manager"] Saved Control connection status for
TransportNode CCPNodeConnectionStatus/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee

c) Apply the below filter in order to identify how many control connection statuses were reported in a particular minute.

grep "Dispatching payload" nsxapi.1.log | grep "ControlConnStatusesMsg" | grep "cvn-hv" | grep "05:59" | wc -l

2023-06-22T05:59:22.502Z DEBUG nsx-rpc:PROTON:user-executor-3 MessageDispatcherImpl 6194 MONITORING [nsx@6876 comp="nsx-manager" level="DEBUG" subcomp="manager"] Dispatching payload (size=22): com.vmware.nsx.management.agg_service.transport_node_status.ControlConnStatusesMsg for source host DeliveryContext{ isNsxRpc: true, amqp_receivedUserId: cvn-hv-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee, application: TransportNodeStatusVertical, amqp_appId: TransportNodeStatusVertical, client-type: cvn-hv, id: nnnnnnnn-mmmm-oooo-kkkk-hhhhhhhhhhhh, client-id: aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee, timestamp: 1687413562502 }

d) Apply the below filter in order to identify the count of total processed control connection statuses in a particular minute.

grep "cvn-hv:TransportNodeStatusVertical" nsxapi.1.log | grep "Received message for" | grep "ControlConnStatuses" | grep "05:59" | wc -l

2023-06-21T09:05:59.035Z INFO cvn-hv:TransportNodeStatusVertical-77080 TransportNodeStatusMessageConsumer 6194 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Received message for ControlConnStatuses : {"status_summary": {"rolled_up_status": "UP","num_up": 1,"num_down": 0,"num_degraded": 0},"controllers": [{"controller_ip": {"ipv4": 1684244527},"status": "UP"}]}

e) The difference in the count between control connection statuses reported and total processed control connection statuses within a time frame indicates a problem. Ideally both counts must be same, that means no statuses are dropped.

f) Lets say the difference in count between filter applied in sub step 'c' and 'd' is 6. This means for 6 TN's CCP status was dropped silently from thread pool (queue size 1000, thread 10, RejectionPolicy.DiscardOldest) which holds statuses from TN's (ControlConnStatuses, PnicBonds, TunnelsSummary). As the queue was full, so some of the requests were dropped.

Environment

VMware NSX-T 3.x

Cause

Some times the TN's received statuses (CCP status) are not processed completely and dropped as thread pool queue reached to the limit and if the CCP status is not updated for 6 mins then heat map's API returns TN status as UNKNOWN. It could happen when the thread pool worker threads are slow because of (slow DB updates or other reason).

Resolution

This issue has been addressed via architectural change in NSX 3.2.0.

Workaround:

There is no workaround for this issue. It gets triggered intermittently on scale environment when some times Corfu DB updates are slow which in turn makes worker threads slow.

Additional Information

Impact/Risks:

The number of TNs report this Unknown state can be high in numbers per SDDC and they will add up when multiple SDDC are monitored. Monitoring becomes challenging with false positive alert (MP/CCP connectivity up) across multiple SDDCs.

There may be many other genuine issues/alerts which needs to be addressed timely. And for a particular TN UNKNOWN node status, there may be some real ones which needs to be addressed. Some may have impact on management plane and some on dataplane, example: tunnels down. This exercise to figure out the false positive ones out of real ones becomes challenging during any dataplane issues.