North/South outage through NSX Edges after Managers are vMotioned to ESXi host without management connectivity

Products

VMware NSX

Issue/Introduction

Two or more NSX-T Managers are vMotioned to an ESXi host which does not have network connectivity for VDS management port group, and North/South connectivity through Edges is lost.

CCP logging examples:

/var/log/cloudnet/nsx-ccp.log on Manager node vMotioned to a host without management connectivity shows CCP disconnected from Corfu: 

2023-08-06T06:31:30.001Z  WARN CorfuRuntime-0 CorfuRuntime 1725 Tried to get layout from <Manager IP>:9000 but failed by timeout
2023-08-06T06:31:31.009Z  WARN CorfuRuntime-0 CorfuRuntime 1725 Tried to get layout from <Manager IP>:9000 but failed by timeout
2023-08-06T06:31:37.013Z  WARN CorfuRuntime-0 CorfuRuntime 1725 Tried to get layout from <Manager IP>:9000 but failed by timeout

CCP's connection to Corfu is restored at some later point when the Manager node is vMotioned back to a host with management connectivity:

2023-08-07T03:33:37.810Z  INFO netty-0 ClientHandshakeHandler 1725 channelRead: Handshake succeeded. Corfu Server Version: [XXXXXXXXXXX]

Upon regaining connectivity, CCP attempts to redo a full sync with Corfu:
- Full sync for monitoring namespace:

2023-08-07T03:34:12.462Z  INFO ForkJoinPool.commonPool-worker-0 UfoStoreManager 1725 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Successfully full synced table monitoring$event with 219 entries
…
2023-08-07T03:34:17.633Z  INFO pool-92-thread-1 TransactionConsumer 1725 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Transaction Consumer received transaction 45382 with size 219

- Full sync for nsx namespace:

2023-08-07T03:34:15.729Z  INFO ForkJoinPool.commonPool-worker-11 UfoStoreManager 1725 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Successfully full synced table nsx$VtepGroup with 0 entries
…
2023-08-07T03:34:20.878Z  INFO pool-92-thread-1 TransactionConsumer 1725 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Transaction Consumer received transaction 45383 with size 16446

- Another full sync for monitoring namespace:

2023-08-07T03:34:21.760Z  INFO ForkJoinPool.commonPool-worker-3 UfoStoreManager 1725 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Successfully full synced table monitoring$event with 219 entries
…
2023-08-07T03:34:23.265Z  INFO pool-92-thread-1 TransactionConsumer 1725 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="ccp"] Transaction Consumer received transaction 45384 with size 219

This series of full syncs eventually clears all the configurations from nsx namespace in the Central Control Plane including Logical Routers.

Logical Routers can be deleted along from CCP with many other objects, and are not added back until CCP is restarted:

2023-08-07T03:34:30.367Z  INFO Owl-worker-15 RedistributionAppImpl 1725 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="Redistribution"] Deleted object for LogicalRouter(<UUID>)

Edge logging in /var/log/syslog when Logical Router is deleted:

 2023-08-07T13:19:11.815Z <Edge hostname> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="ha-app" level="INFO"] 00002000-0000-0000-0000-000000000002 cluster remove <UUID>@<UUID>

 2023-08-07T13:19:11.816Z <Edge hostname> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb-lrouter" level="INFO"] Deleting lrouter

When the Logical Router is deleted, FRR daemons are also killed and BGP goes down as seen in /var/log/frr/frr.log:

 2023-08-07T13:19:12.936Z <Edge hostname> zebra 7786 - -  [EC 4043309117] Client 'system' encountered an error and is shutting down.
 2023-08-07T13:19:12.046Z <Edge hostname> ospfd 8135 - -  Terminating on signal
 2023-08-07T13:19:12.235Z <Edge hostname> bgpd 8095 - -  %ADJCHANGE: neighbor <IP>(Unknown) in vrf default Down Neighbor deleted
 2023-08-07T13:19:12.235Z <Edge hostname> bgpd 8095 - -  %ADJCHANGE: neighbor <IP>(Unknown) in vrf default Down Neighbor deleted
 2023-08-07T13:19:12.892Z <Edge hostname> bgpd 8095 - -  Terminating on signal

Environment

VMware NSX-T Data Center 3.x
VMware NSX 4x

Cause

This issue occurs because of the order of fullsync for different namespaces. As a result, when the monitoring table's fullsync is submitted, all the configurations in the nsx namespace will be treated as "not existing" and eventually get removed.

Resolution

This issue is resolved in VMware NSX 3.2.2, available at Broadcom downloads.
This issue is resolved in VMware NSX 4.1.1, available at Broadcom downloads.

If you are having difficulty finding and download software, please review the Download Broadcom products and software KB.

Note:

The issue with deleting configurations from the nsx namespace with the impact described above is resolved in NSX 3.2.2.

However, the monitoring namespace will still be empty if the issue is hit on 3.2.2+, which will affect Alarms data. The CCP is responsible for syncing Alarm data stored in the monitoring namespace from the management plane to Transport Nodes, and this will prevent any administrative changes to Alarms from taking effect.

The resolution for all namespaces is in NSX 4.1.1.

Workaround:

Restart the central control plane on all Manager nodes one at a time.

As admin: restart service controller
As root: service nsx-ccp restart

Additional Information

Impact/Risks:
If BGP goes down, this can cause a N/S outage.