Broken management channel on port 1234 resulting in datapath impacts

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

There are 2 RPC (Remote Procedure Call) Channels that the NSX manager uses to communicate with transport nodes:

1.) MP (Management Plane) Channel on port 1234

[root@esx-04:~] esxcli network ip connection list | grep 1234
tcp 0 0 #.#.#.#:12352 #.#.#.#:1234 ESTABLISHED 547428 newreno nsx-proxy

2.) CCP (Central Control Plane) channel on port 1235

[root@esx-04:~] esxcli network ip connection list | grep 1235
tcp 0 0 #.#.#.#:42204 #.#.#.#:1235 ESTABLISHED 547428 newreno nsx-proxy
MP retrieves real-time logical router status through the 1st MP channel over port 1234.

If the communication between the manager and the edge transport nodes breaks on port 1234 but the 2nd channel continues to stay up on port 1235, the manager marks the SR status as Down/Unknown for the edge node, however, the edge node is not down.

/var/log/proton/nsxapi.log :

Because of connectivity issue between these edges hosting the SRs and MP, SR status is UNKNOWN and when it stays in that state for 30 minutes, MP removed default route leading to data path disruption.

INFO task-scheduler-1 T0AsymmetricLoadBalancerTask 9580 ROUTING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] For SR 28####da-7##6-4##c-a##9-60d2####dc1a lr LogicalRouter/07####d6-1##b-4##b-9##1-c09e####364d updated srruntime status object with runtime status UNKNOWN

INFO task-scheduler-1 T0AsymmetricLoadBalancerTask 9580 ROUTING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] For SR 28####da-7##6-4##c-a##9-60d2####dc1a LR 07####d6-1##b-4##b-9##1-c09e####364d changing status to DOWN and triggering the activity to disable Nexthop

For T0 AA, if an SR state is not active for 30 minutes, the NSX manager triggers the removal of the default route to that SR (for load balancing) and pushes this removal of the route to transport nodes via the 2nd channel on port 1235.
Below logs can be observed for the route removal:

2024-09-12T11:02:03.802Z edge NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] FIB delete ::/0 for lrouter xxxxx-xxx-xxx-xxxx deleted
2024-09-12T11:02:03.802Z edge NSX 1 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb" level="INFO"] Handle MONITOR message type 49: FIB, update: 20 bytes, delete: 116 bytes
2024-09-12T11:02:03.802Z edge NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] Processing FIB DELETE msg from nestdb
2024-09-12T11:02:03.802Z edge NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] FIB delete 0.0.0.0/0 for lrouter xxxxx-xxx-xxx-xxxx

Since the MP channel on port 1234 goes down resulting in the failure of correct SR status retrieval, the default route to that SR is removed after 30 minutes and the traffic stops getting forwarded to the edge node due to the route removal pushed to transport nodes over port 1235 thus, resulting in datapath impacts.

Environment

VMware NSX-T Data Center

Cause

This is the expected behavior of NSX-T until version 3.1

Resolution

The below enhancements have been made from VMware NSX-T Data Center 3.2 version onwards.

In case of T0 having multiple SR's:

If dataplane of one edge node goes down (irrespective of its MPA connectivity being up or down) and there exists at least one edge node that has dataplane connectivity and MPA connectivity both up, ONLY then:
>> Backplane of that SR moves to the above-mentioned other SR.
>> After half an hour, default route hops of only that SR are removed.
In case of T0 having only one SR, no default route hop removal happens.