Broken management channel on port 1234 resulting in datapath impacts
search cancel

Broken management channel on port 1234 resulting in datapath impacts

book

Article ID: 379698

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

  • There are 2 RPC (Remote Procedure Call) Channels that the NSX manager uses to communicate with transport nodes:

    1.)  MP (Management Plane) Channel on port 1234

          [root@esx-04:~] esxcli network ip connection list | grep 1234
          tcp         0       0  #.#.#.#:12352              #.#.#.#:1234    ESTABLISHED    547428  newreno  nsx-proxy

    2.)  CCP (Central Control Plane) channel on port 1235

          [root@esx-04:~] esxcli network ip connection list | grep 1235
          tcp         0       0  #.#.#.#:42204              #.#.#.#:1235    ESTABLISHED    547428  newreno  nsx-proxy


  • MP retrieves real-time logical router status through the 1st MP channel over port 1234.

  • If the communication between the manager and the edge transport nodes breaks on port 1234 but the 2nd channel continues to stay up on port 1235, the manager marks the SR status as Down/Unknown for the edge node, however, the edge node is not down.

    /var/log/proton/nsxapi.log :

    Because of connectivity issue between these edges hosting the SRs and MP, SR status is UNKNOWN and when it stays in that state for 30 minutes, MP removed default route leading to data path disruption.

    INFO task-scheduler-1 T0AsymmetricLoadBalancerTask 9580 ROUTING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] For SR 28####da-7##6-4##c-a##9-60d2####dc1a lr LogicalRouter/07####d6-1##b-4##b-9##1-c09e####364d updated srruntime status object with runtime status UNKNOWN
    
    INFO task-scheduler-1 T0AsymmetricLoadBalancerTask 9580 ROUTING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] For SR 28####da-7##6-4##c-a##9-60d2####dc1a LR 07####d6-1##b-4##b-9##1-c09e####364d changing status to DOWN and triggering the activity to disable Nexthop

     

  • For T0 AA, if an SR state is not active for 30 minutes, the NSX manager triggers the removal of the default route to that SR (for load balancing) and pushes this removal of the route to transport nodes via the 2nd channel on port 1235.

  • Below logs can be observed for the route removal:

    2024-09-12T11:02:03.802Z edge NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] FIB delete ::/0 for lrouter xxxxx-xxx-xxx-xxxx deleted
    2024-09-12T11:02:03.802Z edge NSX 1 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb" level="INFO"] Handle MONITOR message type 49: FIB, update: 20 bytes, delete: 116 bytes
    2024-09-12T11:02:03.802Z edge NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] Processing FIB DELETE msg from nestdb
    2024-09-12T11:02:03.802Z edge NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] FIB delete 0.0.0.0/0 for lrouter xxxxx-xxx-xxx-xxxx

 

  • Since the MP channel on port 1234 goes down resulting in the failure of correct SR status retrieval, the default route to that SR is removed after 30 minutes and the traffic stops getting forwarded to the edge node due to the route removal pushed to transport nodes over port 1235 thus, resulting in datapath impacts.

Environment

VMware NSX-T Data Center

Cause

This is the expected behavior of NSX-T until version 3.1

Resolution

The below enhancements have been made from VMware NSX-T Data Center 3.2 version onwards.

  • In case of T0 having multiple SR's:

    If dataplane of one edge node goes down (irrespective of its MPA connectivity being up or down) and there exists at least one edge node that has dataplane connectivity and MPA connectivity both up, ONLY then:
     >> Backplane of that SR moves to the above-mentioned other SR.
     >> After half an hour, default route hops of only that SR are removed.

  • In case of T0 having only one SR, no default route hop removal happens.