Management connection status is down for transport node even if TCP connection state is established
search cancel

Management connection status is down for transport node even if TCP connection state is established

book

Article ID: 372464

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

  • TN to management plane connection appears to be down in NSX-T manager UI / via API even though the TCP connection is established from the transport node to the NSX manager node over port 1234
  • Step to validate from UI :

NSX-T UI -> System -> Nodes -> Edge transport node -> Overview -> Manager Connectivity "DOWN"

  • Step to validate using API:
    GET https://<nsx-mgr>/api/v1/transport-zones/transport-node-status

In response see  "mgmt_connection_status": "DOWN",

  • From syslog (/var/log/syslog) of  edge TN or in host TN (/var/log/nsx-syslog.log), it is always trying to do Discovery with one manager and the node is on an inconsistent state 

Here edge TN* for reference has connection to APH* as connected and not connected at the same time 

Here, it says, it is NOT_CONNECTED
2024-06-18T10:13:46.399Z  NSX 4945 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="4965" level="INFO"] ForwardingEngine::ReconcileConnections adding ssl://10.4.xx.xx:1234 uuid <uuid> -- existing connection state is NOT_CONNECTED

Here it says it is CONNECTED with AphConnectionManager
2024-06-18T10:13:46.399Z  NSX 4945 - [nsx@6876 comp="nsx-edge" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="4965" level="ERROR" invalid="true"] AphConnectionManager: Already connected to endpoint ssl://10.4.xx.xx:1234 uuid <uuid>

  • We can observe race condition where both thread 2855675 and 2855687 are doing ProcessConfig, where one thread is doing config update and another is doing discovery.

2024-06-18T15:07:11.371Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855675" level="INFO"] DiscoveryManager: Received following call status from endpoint ssl://10.4.xx.xx:1234: SUCCESS
2024-06-18T15:07:11.374Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855675" level="INFO"] Adding member aph ssl://10.4.xx.xx:1234 - 10.4.xx.xx:1234
2024-06-18T15:07:11.374Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855687" level="INFO"] Entity added: MP, <uuid>, master = true
2024-06-18T15:07:11.374Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855675" level="INFO"] Entity added: MP, <uuid>, master = true
2024-06-18T15:07:11.374Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855675" level="INFO"] Writing APH info to file '/etc/vmware/nsx/appliance-info.xml'
2024-06-18T15:07:11.375Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855687" level="INFO"] Writing APH info to file '/etc/vmware/nsx/appliance-info.xml'
2024-06-18T15:07:11.375Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855675" level="INFO"] Successfully updated /etc/vmware/nsx/appliance-info.xml
2024-06-18T15:07:11.375Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855687" level="INFO"] Successfully updated /etc/vmware/nsx/appliance-info.xml
2024-06-18T15:07:41.383Z nsx-proxy[2855675]: NSX 2855675 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2855687" level="INFO"] ActionDiscovery: Timed out waiting to get connected with Master APH.

  • As Discovery sequence and Configuration update is going on simultaneously on this host, discovery is not being successful.

Environment

NSX-T 3.1.x and NSX-T 3.2.0 - 3.2.3

Cause

Processing a ConfigUpdate from master MP and processing a Discovery response simultaneously causes a race condition . After this the connection state never recovers and remains disconnected, which results in the system stuck in inconsistent state

Resolution

This race condition has been resolved in NSX v3.2.4 and NSX  v4.x

 

Workaround:

Restart 'nsx-proxy' service from TN

Login to root of Edge/Host TN and execute command '/etc/init.d/nsx-proxy restart' OR reboot the specific edge/host 

 

 

Additional Information

 *Appliance Proxy Hub (APH) acts as a communication channel between NSX Manager and a transport node. It runs as a service on NSX Manager and provides secure connection between a transport node and NSX Manager

* TN - Transport node