Network interruption for NSX VMs and Tunnels down during high volume of vMotions

Products

VMware NSX

Issue/Introduction

VMs experience intermittent network connectivity issues.
VM freezes after ESXi patching, not able to connect to the VM networking.
After a VM moves from one ESX host to another, it may become temporarily unreachable on NSX overlay network.
Newly deployed VMs or restarted pods may have no network connectivity.
BFD tunnels may be reported down or seen flapping in NSX UI.
At the time of the issue, there is a high rate of concurrent vMotions.
This may happen during an operation that involves multiple ESX hosts going into Maintenance mode:
- During an NSX upgrade, if a lot of ESX hosts are done in parallel (or clusters therefore hosts).
- Or during an ESX upgrade in clusters prepared for NSX, if many ESX hosts are done in parallel, or at a fast rate.
Log lines similar to the below are encountered on the NSX Manager in /var/log/cloudnet/nsx-ccp.log
```
[nsx@6876 comp="nsx-controller" level="INFO" subcomp="falcon"] Batching 1 falcon transactions. Remaining #### in the queue.
```
Explanation:
- This logline will be present if
  1) The remaining queue size exceeds 100 elements
  or
  2) The batch contains more than 100 elements
- Many batches of '1' falcon transactions are being processed at a time (as opposed to batches of multiple transactions).
- Observing the "remaining" items queue shows that if the queue grows faster than the NSX Manager can process it.
- A queue with more than 1000 transactions remaining may cause delays in the processing of inventory updates by the NSX Central Control Plane.
- Note that all NSX Manager appliances in the cluster should be checked for this pattern.
Impact is no longer seen after, or shortly after, the falcon queue size is down to 0.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

This issue is seen in environments which have inventory objects that exceed supported limits e.g. Groups, see VMware Configuration Maximums
A lot of vMotion events may overwhelm the NSX Central Control Plane (CCP).
Tasks are queued for processing by the system component called falcon.
Some tasks are processed one by one, including vMotion, implying delays for the other queued tasks.
If enough delay occurs between the reception of a TEP membership update, its processing and propagation of resulting TEP update, impact may be observed on entities that rely on TEP memberships, such as BFD tunnels and NSX overlay connectivity.

For instance, when a VM is vMotion'ed from ESX host A to ESXi host B, the rest of the NSX network must be updated to know that the VM is now reachable via the TEP of ESX host B.
If the TEP update is delayed, then the VM will be unreachable from the rest of the network until that TEP update is processed and propagated (as GENEVE traffic will be incorrectly sent to the TEP of ESX host A).

BFD tunnels between ESX hosts are also established based on TEP memberships. If enough delay occurs during TEP upgrades, then BFD tunnels may reported down or seen flapping in the NSX UI.

Resolution

This issue is resolved in VMware NSX 4.2.3.2, available at Broadcom downloads.

The recommended workaround is to reconfigure the environment to ensure inventory objects are within supported limits, see VMware Configuration Maximums.

The following are alternative workarounds, these will reduce the risk of impact however they will not necessarily eliminate it.

Reduce the rate of vMotion operations.
If encountered during an NSX upgrade, reduce the number of ESX hosts (or clusters) upgraded in parallel.
If encountered during an ESX upgrade, reduce the rate of the ESX upgrade.
If large NSX inventory objects exist, increase the form factor of the NSX Manager appliances, potentially to Extra Large.
Refer to the NSX Documentation: NSX Manager VM and Host Transport Node System Requirements

Additional Information

In a live NSX environment, you can monitor the falcon queue with the following command:

As admin from NSX CLI:

get log-file controller follow | find "Batching 1 falcon transactions.*Remaining"

As root:

tail -f /var/log/cloudnet/nsx-ccp.log | grep -E "Batching 1 falcon transactions.*Remaining"