To assist in the identification of a heap memory issue within the ESXi host related to DFW
Symptoms:
VMware NSX-T Data Center 3.x
VMware NSX-T Data Center
During the vMotion of the VM, the following lines would be observed in the /var/run/log/vmkernel.log
2022-xx-xxT14:20:05.743Z cpu29:2097834)IMPORTTLVSTATES failed: 12 <<<<< VSIP running out of memory
2022-xx-xxT14:20:05.743Z cpu29:2097834)Failed to restore datapath state : Failure <<<<<
2022-xx-xxT14:20:05.744Z cpu29:2097834)DVFilter: 1309: Bringing down port due to failed DVFilter state restoration and failPolicy of FAIL_CLOSED. <<<
- From the ESXi log bundle commands/vsip_heap_stats.sh_-l.txt, At least one component would show 'leastPercentFree' as 0 (Zero)
Here is an example for vsip-attr-0x431dfd000000:
{'name': 'vsip-attr', 'moduleID': 91, 'isDynamic': 1, 'physContigType': 1, 'lowerLimit': 0, 'upperLimit': -1, 'reserved': 0, 'memPool': 2989, 'ranges': 1, 'dlmallocOverhead': 1032, 'currentSize': 1342181800, 'initialSize': 8388608, 'currentAlloc': 10833608, 'currentAvail': 1331348192, 'currentReleasable': 3408, 'currentPercentFree': 99, 'currentPercentReleasable': 0, 'maximumSize': 1342181800, 'maximumAvail': 1331348192, 'maximumPercentFree': 99, 'leastPercentFree': 0, 'failedReqLogCount': 3119, 'numSucceededAllocations': 346527594, 'numFailedAllocations': 1558, 'numFreedAllocations': 346453602, 'avgAllocationSize': 134217728, 'numRequestsPerGrowth': 12, 'numGrowthOps': 3071, 'numShrinkOps': 8, 'pageSize': 4096}
- From the ESXi log bundle commands/vsipioctl_info.sh.txt for one or more VMs under "/bin/vsipioctl getfilterstat -f <nic-XXXXX-ethX-vmware-sfw.2>", user would see huge packet drops due to memory
Here is an example:
/bin/vsipioctl getfilterstat -f nic-xxxxx-eth0-vmware-sfw.2
PACKETS IN OUT
------- -- ---
<snip>
BYTES IN OUT
----- -- ---
<snip>
DROP REASON
-----------
memory: 1851657 <<<<< Packet drops with the drop reason as memory.
<snip>
- From the ESXi log bundle commands/vsipioctl_info.sh.txt under "/bin/vsipioctl getmeminfo", users would see significant allocation failures represented by counter "numFail". Please note that the "inuse" in the below example shows close to 2 million states, but this may not always be the case, had the offending VMs migrated off the ESXi host.
/bin/vsipioctl getmeminfo
Heap: vsip-module, max 2560 MB
<snip>
Heap: vsip-state, max 512 MB
zone 2: pfstatepl maxObj = 2000000, objSize = 624, alloc = 238361714, free = 236362756, inUse = 1998958, numFail = 15681698, totalMem = 1247349792 <<<<<<<<<<<<<<<<<<<<<<<
<Snip>
Issue-specific to the environment. No resolution. Follow the best practices
Workaround:
Few recommendations:
1. Enable Flood protection on DFW and/or Edge Firewall. https://techdocs.broadcom.com/us/en/vmware-cis/nsx/vmware-nsx/3-2/administration-guide.html
2. Tweak/Create the session timer to aggressively age out the idle TCP sessions.
3. Use the "TCP strict" for specific rule sections this can protect the system from flows that don't follow the TCP state machine.
4. Additionally, DNS security can be configured to help guard against DNS-related attacks.
5. Check for the offending IPs that are creating a huge number of sessions, validate if those sessions are legitimate, and If not block them.
6. Plan towards achieving the desired "any any any deny" rule at the bottom of the firewall rules, by subjecting part of the workloads to DFW and blocking the unnecessary flows.