In a NSX environment, where VMs are protected by DFW, upon vMotion of the VM, there is a possibility that the destination host may encounter a PCPU lockup. This can lead to vmnic(s) flapping for a few seconds and auto recover.
Symptoms / Observations:
1. On the ESXi host, following errors/issues are observed:
a. Long vMotion import times (>10sec) observed in vmkernel logs
Below is an example, when there is a problem encountered during vmotion import:
2025-03-13T19:52:11.599Z In(182) vmkernel: cpu2:2098325)Importing nic-#####-eth0-vmware-sfw.2, Version 11002025-03-13T19:52:11.599Z In(182) vmkernel: cpu2:2098325)VSIP module ioctls: disabled2025-03-13T19:52:11.599Z In(182) vmkernel: cpu2:2098325)ImportStateTLV entry type 12, len 52, cnt 12025-03-13T19:52:11.599Z In(182) vmkernel: cpu2:2098325)Importing from source version RELEASEbuild-241058192025-03-13T19:52:11.599Z In(182) vmkernel: cpu2:2098325)ImportStateTLV entry type 1, len 15275496, cnt 951 <<<<< Note the timestamp of ImportStateTLV entry type 12025-03-13T19:52:11.600Z In(182) vmkernel: cpu112:2097291)NetPort: 708: Failed to acquire port non-exclusive lock 0x600008c[Failure].
.... <16 second gap>
2025-03-13T19:52:27.599Z In(182) vmkernel: cpu2:2098325)ImportStateTLV entry type 2, len 682296, cnt 1786 <<<< Note the timestamp of ImportStateTLV entry type 22025-03-13T19:52:27.600Z In(182) vmkernel: cpu2:2098325)configured filter nic-#####-eth0-vmware-sfw.22025-03-13T19:52:27.600Z In(182) vmkernel: cpu2:2098325)filter nic-#####-eth0-vmware-sfw.2 flushing flow cache
As you can see, the import of the tables with IP addresses (i.e Entry Type 1) took ~ 16 seconds
b. PCPU lock up messages, along with a trace showing the pfp_add_table_one_addr vsip function call
2025-03-13T11:25:10.576Z Wa(180) vmkwarning: cpu125:2106800)WARNING: Heartbeat: 961: PCPU 46 didn't have a heartbeat for 6 seconds, timeout is 10, 1 IPIs sent; *may* be locked up.
2025-03-13T11:25:10.576Z In(182) vmkernel: cpu125:2106800)Heartbeat: 1014: Sending timer IPI to PCPU 46
2025-03-13T11:25:10.576Z In(182) vmkernel: cpu46:2098327)0x453ae061b4a0:[0x42000cc1124e][email protected]#1.0.8.0.24105819+0x3b stack: 0x434fedae0e40
2025-03-13T11:25:10.576Z In(182) vmkernel: cpu46:2098327)0x453ae061b530:[0x42000cc5e4a7][email protected]#1.0.8.0.24105819+0xdc stack: 0x434ff72e2900
2025-03-13T11:25:19.576Z Wa(180) vmkwarning: cpu97:2099893)WARNING: Heartbeat: 961: PCPU 46 didn't have a heartbeat for 15 seconds, timeout is 10, 2 IPIs sent; *may* be locked up
c. "packets completion seems stuck, issuing reset" messages observed in vmkernel logs
2025-03-14T04:01:18.510Z Wa(180) vmkwarning: cpu98:2390608)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6769: vmnic4: packets completion seems stuck, issuing reset2025-03-14T04:01:26.259Z Wa(180) vmkwarning: cpu71:2390678)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6769: vmnic10: packets completion seems stuck, issuing reset
Note: NetIOC was enabled on the ESXi hosts, netioc watchdog timeout is 5 seconds, so due to the holding of the lock mentioned above, it will trigger the nic reset when timeout is reached. if NetIOC was disabled, the nic resets would not have happened.
d. Multiple vmnic flaps observed in vobd.log, as mentioned in #b above, this is observed as a result of enabling NetIOC
2025-03-14T04:01:28.143Z In(14) vobd[2098148]: [netCorrelator] 45093545317us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic4 is down. Affected dvPort: 93ceb750-2cdc-4def-b578-45ccc97c9d3d/## ## ## ## ## ## ##-## ## ## ## ## ## ## ##. 1 uplinks up. Failed criteria: 1282025-03-14T04:01:28.284Z In(14) vobd[2098148]: [netCorrelator] 45093777479us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic10 is down. Affected dvPort: 39cdf7f4-9235-4e96-8729-6028d2eb0e82/## ## ## ## ## ## ##-## ## ## ## ## ## ## ##. 0 uplinks up. Failed criteria: 128
e. VMs with high number of rules applied to their nic dvfilter
Note: (In this scenario, we saw VMs with > 1500 rules exhibiting the issue, but this number can be less or more)
To calculate rule count per vnic:
vsipioctl getrules -f <filter_name> | grep -E "rule.*at" | wc -l
f. Lot of duplicate IPs (>90%) in the realized addrsets on the ESXi host
Here is an example:
─$ vsipioctl getaddrsets -f nic-#####-ethX-vmware-sfw.2 | grep '^ip ' | wc -l904039 <=== Total IPs in address sets─$ vsipioctl getaddrsets -f nic-#####-ethX-vmware-sfw.2 | grep '^ip ' | sort | uniq -c | sort -nrk 1 | grep -v " 1 " | awk '{sum+=$1} END{ print sum}'897310 <=== Repeated IPs in the address sets
More than 99% of the IPs are repeated.
2. In the NSX Manager, NSgroups definition that overlaps or same as other NSgroups. In this case, same child groups were used in multiple parent NSgroups. These parent NSgroups are used in several DFW rules.
VMware NSX 4.2.0.x, 4.2.1.x
Resolution:
Upgrade to NSX 4.2.2.1 or NSX 4.2.3 or NSX 9.0.1.0
Workaround: