Stale Entries in the Transport Node and Message Client Table
search cancel

Stale Entries in the Transport Node and Message Client Table

book

Article ID: 316655

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:

Host and Edge Nodes are not able to connect to NSX-T Managers via port 1234 (NSX Proxy).

In the Edge Nodes' syslog (/var/log/syslog), you can observe log lines such as:


 

2021-06-29T03:53:16.258Z nsxedg1.vmware.local NSX 1874 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="nsx-net" tid="1875" level="WARNING"] StreamConnection[21154 Connecting to ssl://10.10.10.10:1234 sid:21154] Couldn't connect to 'ssl://10.10.10.10:1234' (error: 335544539-short read)
2021-06-29T03:53:16.258Z nsxedg1.vmware.local NSX 1874 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="nsx-net" tid="1875" level="WARNING"] StreamConnection[21154 Error to ssl://10.65.10.132:1234 sid:-1] Error 335544539-short read


** "335544539-short read" means that the TCP connections fail after SSL handshake. This implies that there is a problem in the certificates, which is a generic problem (not inherently specific to this problem).

Additionally, in the Proton log (/var/log/proton/nsxapi.log) , you also see the following log lines:

2021-06-29T01:06:32.472Z INFO nsx-rpc:PROTON_TNCONNCFGSVC_CLIENT:user-executor-0 NodeAphCallObserver - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] exception in next() com.vmware.nsx.management.common.exceptions.ObjectNotFoundException: The requested object : Node/ce2df50a-d042-11ea-8576-0025b5d4fc01 could not be found. Object identifiers are case sensitive.
 at com.vmware.nsx.management.container.dao.AbstractDao.findByIdentifier(AbstractDao.java:150)
 at com.vmware.nsx.management.container.dao.IdentifiableObjectDaoDelegate.findByIdentifier(IdentifiableObjectDaoDelegate.java:205)
 at com.vmware.nsx.management.fabricnode.service.NodeServiceImpl.findById(NodeServiceImpl.java:51)
 at com.vmware.nsx.management.fabricnode.service.NodeServiceImpl.getNode(NodeServiceImpl.java:46)
 at com.vmware.nsx.management.fabricnode.aph.NodeAphRealizer.getTnConnCfgFromDb(NodeAphRealizer.java:243)
 at com.vmware.nsx.management.fabricnode.aph.NodeAphCallObserver.next(NodeAphCallObserver.java:58)
 at com.vmware.nsx.management.fabricnode.aph.NodeAphCallObserver.next(NodeAphCallObserver.java:30)
 at com.vmware.nsx.rpc.call.NsxRpcCall$ActiveCallStateBase.invokeNext(NsxRpcCall.java:266)
 at com.vmware.nsx.rpc.call.NsxRpcCall$LocalDoneCallState.doReceive(NsxRpcCall.java:524)
 at com.vmware.nsx.rpc.call.NsxRpcCall.doReceive(NsxRpcCall.java:999)
 at com.vmware.nsx.rpc.channel.NsxRpcChannel.doReceiveEstablishedCall(NsxRpcChannel.java:687)
 at com.vmware.nsx.rpc.channel.NsxRpcChannel.doReceive(NsxRpcChannel.java:624)
 at com.vmware.nsx.rpc.channel.task.ChannelReceiveTask.doRun(ChannelReceiveTask.java:21)
 at com.vmware.nsx.rpc.channel.task.ChannelTask.run(ChannelTask.java:45)
 at com.vmware.nsx.rpc.channel.NsxRpcChannel.processOperations(NsxRpcChannel.java:835)
 at com.vmware.nsx.rpc.core.Scheduler.process(Scheduler.java:112)
 at com.vmware.nsx.rpc.core.Scheduler.run(Scheduler.java:90)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

** ObjectNotFoundException means that there is a stale object. In this case it is an Edge Node, whose uuid is ce2df50a-d042-11ea-8576-0025b5d4fc01.


To be encountering this issue, both unique log lines must be present, as the first log line itself is not independently indicative of this specific issue. 

 


Environment

VMware NSX-T Data Center

Cause

When an Edge Transport Node is deleted, its corresponding entry in all Management Plane database tables should be deleted. In a rare situation, there can be stale entries database tables for either Transport Nodes and/or Messaging Clients (A table Transport Nodes depend on). These stale entries generate messages, which cannot be serviced by the Management Plane. These messages block the message queue and prevent the Management Plane servicing other messages in the message queue. As a result, the Management Plane cannot send messages to any Transport Nodes, including the certificates used for authentication between Transport Nodes and the Management Plane. When hosts do not have the updated certificates from the Management Plane, they cannot form a connection to the Management Plane.
 
The cause of the stale entries cannot be determined because the log is overwritten.

Resolution

This issue is will no longer occur in NSX T 3.1.1 and onwards. 

Workaround:
If you encounter this issue, please contact VMware GSS via an SR and mention this KB article.

Additional Information

Impact/Risks:
Customers are not able to make any configuration changes on Host Transport Nodes and Edge Transport Nodes if this issue is present.