Edge node is reporting as UNKNOWN state in the NSX Manager

Products

VMware NSX

Issue/Introduction

Edge node reports as UNKNOWN state in NSX Manager
Issue is seen in NSX-T version 3.1.2.0
Running the API call, GET https://<NSX-T_manager-IP>/api/v1/transport-nodes/<tn-id>/status reports the edge node as UNKNOWN
Alternatively, you can run the following command from NSX manager via root login
curl -X GET -k -u admin:'############' "https://nsx-manager-ip/api/v1/transport-nodes/<transport-node-id>/status"
Upgrade pre-check stage fails as the edge node is in UNKNOWN state
On the NSXT manager orchestrator node, the following log shows the error below :- /var/log/upgrade-coordinator/upgrade-coordinator.log:
To find the orchestrator node, login to any of the NSX manager as admin and run the command get service install-upgrade. Note down the IP of the orchestrator node

2024-05-28T03:37:55.878Z WARN http-nio-127.0.0.1-7442-exec-5 EdgeUuUtilsServiceImpl 18005 SYSTEM [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="upgrade-coordinator"] Detect issues with Edge upgrade unit fabricId eba35663-####-#####-####-0ee94b87aeba TransportNodeId eba35663-####-#####-####-0ee94b87aeba: [Pnic status of the edge transport node eba35663-####-#####-####-0ee94b87aeba is UNKNOWN., Overall status of the edge transport node eba35663-####-#####-####-0ee94b87aeba is UNKNOWN., Tunnel status of the edge transport node eba35663-####-#####-####-0ee94b87aeba is UNKNOWN.]

/var/log/proton/nsxapi.log on the NSX manager node reports ConcurrentUpdateException

2024-05-28T05:59:08.095Z WARN gsr-summation-cache-committer-1 ObjectsView 32276 TXEnd[TX[7e74]] Aborted Exception org.corfudb.runtime.exceptions.TransactionAbortedException: TX ABORT | Snapshot Time = Token(epoch=16, sequence=1162255862) | Failed Transaction ID = c59dc1b6-####-#####-####-b10cb567e74 | Offending Address = 1162255877 | Conflict Key = A7F0C534358D5655 | Conflict Stream = 8d4e54c2-####-#####-####-c04be92c4c52 | Cause = CONFLICT | Time = 579 ms
2024-05-28T05:59:08.096Z WARN gsr-summation-cache-committer-1 CorfuDbTransactionManager 32276 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Received TransactionAbortedException from the Corfu client.
2024-05-28T05:59:08.107Z WARN gsr-summation-cache-committer-1 CorfuDbTransactionManager 32276 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] com.vmware.nsx.management.container.exceptions.ConcurrentUpdateException: STREAM_ID = 8d4e54c2-####-#####-####-c04be92c4c52 | CONFLICT_VALUE = GenericStatsRecords [prefix=null, counterValues=[315042970, 108862342068, 8477336, 8406909], gcClassName=null, gcMethodName=null, createdTime=1704733060090, lastUpdateTime=1716875753032] | CONFLICT_KEY_HASH = -6345355046937602475 | CONFLICT_KEY = SummationGenericStatsRecords2/DFWRuleStats?0?D?1135 | MAP_NAME = nsx-manager SummationGenericStatsRecords2 8b7e | TRANSACTION_ID = c59dc1b6-####-#####-####-b10cb567e74 | OFFENDING_ADDRESS = 1162255877
2024-05-28T05:59:08.107Z ERROR gsr-summation-cache-committer-1 TransactionHelper 32276 - [nsx@6876 comp="nsx-manager" errorCode="MP6408" level="ERROR" subcomp="manager"] Commit failed
2024-05-28T05:59:08.224Z ERROR aggprocessor-wait-for-collection-timer AggregationProcessorImpl 32276 MONITORING [nsx@6876 comp="nsx-manager" errorCode="MP6405" level="ERROR" subcomp="manager"] gsrSummationCache gsr-summation-cache commit result is not normal, result = CommitResult [dataCommitErrors=123, workerCommitErrors=0]
2024-05-28T05:59:08.224Z ERROR aggprocessor-wait-for-collection-timer AggregationProcessorImpl 32276 MONITORING [nsx@6876 comp="nsx-manager" errorCode="MP6401" level="ERROR" subcomp="manager"] Summation statistics cache commit encountered error

/var/log/proton/nsxapi.log reports the update transport node status to unknown due to timeout on the node.

2024-05-28T06:00:01.667Z INFO http-nio-127.0.0.1-7440-exec-46 TransportNodeLcmFacadeImpl 32276 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="2920769e-####-#####-####-ac02d1086a22" subcomp="manager" username="nsx_policy"] TransportNodeFacade : getTransportNode(..) for id [eba35663-####-#####-####0ee94b87aeba]
2024-05-28T06:00:01.702Z INFO http-nio-127.0.0.1-7440-exec-33 HeatMapServiceImpl 32276 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" reqId="180dc84b-####-#####-####-41e8d3fe5629" subcomp="manager" username="nsx_policy"] Updated Tunnel connection status for TransportNode eba35663-####-#####-####0ee94b87aeba
2024-05-28T06:00:01.719Z INFO http-nio-127.0.0.1-7440-exec-33 EdgeNodeInstallInfo 32276 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="180dc84b-####-#####-####-41e8d3fe5629" subcomp="manager" username="nsx_policy"] Node EdgeNodeInstallInfo/eba35663-####-#####-####-0ee94b87aeba State: NODE_READY TN Config State: TRANSPORT_NODE_SYNC_PENDING
2024-05-28T06:00:01.719Z INFO http-nio-127.0.0.1-7440-exec-33 EdgeNodeInstallInfo 32276 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="180dc84b-####-#####-####-41e8d3fe5629" subcomp="manager" username="nsx_policy"] Node EdgeNodeInstallInfo/eba35663-####-#####-####-0ee94b87aeba State: NODE_READY TN Config State: TRANSPORT_NODE_SYNC_PENDING
2024-05-28T06:01:04.734Z INFO HeatMap-ConnCheck-Thread HeatmapConnCheckService 32276 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] node eba35663-####-#####-####-0ee94b87aeba ccp update timeout, time stamp: current 1716876064733, ccp 1716874460829, interval 360000 in milliseconds
2024-05-28T06:01:04.763Z INFO HeatMap-ConnCheck-Thread HeatmapConnCheckService 32276 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] update node status to unknown due to timeout for node eba35663-####-#####-####0ee94b87aeba
(END)

Environment

NSX-T version 3.1.2.0

Cause

Update action was aborted for ConcurrentUpdateException, and ConcurrentUpdateException was converted to TransactionAbortedException.

Resolution

This is a known issue and the engineering team is aware of it. The issue is resolved in the future release NSX 3.2.x or higher

As a workaround, you can restart the proton service on the manager nodes.

/etc/init.d/proton status #check the status of the service
/etc/init.d/proton restart #restart the proton service

Note: Make sure the proton service is up and the cluster is stable before restarting the service on the second and third nodes.
To ensure the NSX cluster stability, login to any of the NSX manager as admin and run the command get cluster status | find Status