VMs lose network connectivity after hitting the maximum connection limit on the ESXi host.

Products

VMware NSX

Issue/Introduction

To assist in the identification of a heap memory issue within the ESXi host related to DFW

Symptoms:

Existing VMs randomly lose E-W and N-S network connectivity when subjected to DFW.
VMs gets disconnected from the network post vMotion to a host with vsip-related memory constraints.

Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

During the vMotion of the VM, the following lines would be observed in the /var/run/log/vmkernel.log

2022-xx-xxT14:20:05.743Z cpu29:2097834)IMPORTTLVSTATES failed: 12 <<<<< VSIP running out of memory

2022-xx-xxT14:20:05.743Z cpu29:2097834)Failed to restore datapath state : Failure <<<<<

2022-xx-xxT14:20:05.744Z cpu29:2097834)DVFilter: 1309: Bringing down port due to failed DVFilter state restoration and failPolicy of FAIL_CLOSED. <<<

- From the ESXi log bundle commands/vsip_heap_stats.sh_-l.txt, At least one component would show 'leastPercentFree' as 0 (Zero)

Here is an example for vsip-attr-0x431dfd000000:
{'name': 'vsip-attr', 'moduleID': 91, 'isDynamic': 1, 'physContigType': 1, 'lowerLimit': 0, 'upperLimit': -1, 'reserved': 0, 'memPool': 2989, 'ranges': 1, 'dlmallocOverhead': 1032, 'currentSize': 1342181800, 'initialSize': 8388608, 'currentAlloc': 10833608, 'currentAvail': 1331348192, 'currentReleasable': 3408, 'currentPercentFree': 99, 'currentPercentReleasable': 0, 'maximumSize': 1342181800, 'maximumAvail': 1331348192, 'maximumPercentFree': 99, 'leastPercentFree': 0, 'failedReqLogCount': 3119, 'numSucceededAllocations': 346527594, 'numFailedAllocations': 1558, 'numFreedAllocations': 346453602, 'avgAllocationSize': 134217728, 'numRequestsPerGrowth': 12, 'numGrowthOps': 3071, 'numShrinkOps': 8, 'pageSize': 4096}

- From the ESXi log bundle commands/vsipioctl_info.sh.txt for one or more VMs under "/bin/vsipioctl getfilterstat -f <nic-XXXXX-ethX-vmware-sfw.2>", user would see huge packet drops due to memory

Here is an example:
/bin/vsipioctl getfilterstat -f nic-xxxxx-eth0-vmware-sfw.2
PACKETS IN OUT
------- -- ---
<snip>

BYTES IN OUT
----- -- ---
<snip>

DROP REASON
-----------
memory: 1851657 <<<<< Packet drops with the drop reason as memory.
<snip>

- From the ESXi log bundle commands/vsipioctl_info.sh.txt under "/bin/vsipioctl getmeminfo", users would see significant allocation failures represented by counter "numFail". Please note that the "inuse" in the below example shows close to 2 million states, but this may not always be the case, had the offending VMs migrated off the ESXi host.

/bin/vsipioctl getmeminfo
Heap: vsip-module, max 2560 MB
<snip>

Heap: vsip-state, max 512 MB
zone 2: pfstatepl maxObj = 2000000, objSize = 624, alloc = 238361714, free = 236362756, inUse = 1998958, numFail = 15681698, totalMem = 1247349792 <<<<<<<<<<<<<<<<<<<<<<<
<Snip>

Resolution

Issue-specific to the environment. No resolution. Follow the best practices

Workaround:
Few recommendations:
1. Enable Flood protection on DFW and/or Edge Firewall. Flood Protection
2. Tweak/Create the session timer to aggressively age out the idle TCP sessions.
3. Use the "TCP strict" for specific rule sections this can protect the system from flows that don't follow the TCP state machine.
4. Additionally, DNS security can be configured to help guard against DNS-related attacks.
5. Check for the offending IPs that are creating a huge number of sessions, validate if those sessions are legitimate, and If not block them.
6. Plan towards achieving the desired "any any any deny" rule at the bottom of the firewall rules, by subjecting part of the workloads to DFW and blocking the unnecessary flows.

Additional Information

Impact/Risks:

Partial/intermittent network connectivity issue for the existing workload on the host
Complete connectivity loss for the migrated VMs