Multiple production VMs report Network and Storage performance issues
High ping latency observed between ESXi hosts in the problem clusters for the VTEP's
BGP flaps reported intermittently between NSX edge and uplink switch
Root cause for this issue is the introduction of rule stats collection in 6.4.2 which is enabled by default where NSX DFW UI provides FW “Rule hit Stats” information.
All the events from NSX side which would result in firewall publish and eventually lead to stat collection and network latency
The rule stats feature gets triggered every 5 minutes. During this stage there is a PCPU lockup on datapath that causes the latency.
This feature can be disabled at Management plane (MP) but cannot be disabled at the dataplane till 6.4.4.
PCPU lockups gets triggered during rule publish event also and this will cause similar latency.
Graphical representation of DFW Rule stats Feature on NSX manager:
Log messages relevant to the issue
Below logs can be checked on impacted esxi hosts
From vsfwd logs we see a firewall publish event which acts as trigger for rule stats collection along with Rule stat hits which eventually triggers latency
File:-vsfwd.log, Location:- /var/run/log
2019-05-02T15:42:00Z vsfwd: [INFO] Applied RuleSet 1556795924921 for all vnics
2019-05-02T15:42:00Z vsfwd: [INFO] Compressed config data from 3130764 to 406478 bytes2019-05-02T15:42:00Z vsfwd: [INFO] Successfully saved config to file /etc/vmware/vsfwd/vsipfw_ruleset.dat
2019-05-02T15:42:03Z vsfwd: [INFO] Filter nic-81423981-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:03Z vsfwd: [INFO] created rule hit count of filter nic-81423981-eth0-vmware-sfw.2
2019-05-02T15:42:06Z vsfwd: [INFO] Filter nic-81421968-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:06Z vsfwd: [INFO] created rule hit count of filter nic-81421968-eth0-vmware-sfw.2
2019-05-02T15:42:09Z vsfwd: [INFO] Filter nic-81422860-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:09Z vsfwd: [INFO] created rule hit count of filter nic-81422860-eth0-vmware-sfw.2
Note: The rule stats get initiated whenever there is a publish of rule happens at data plane.
From the hostd logs we would see below triggers for PCPU locks:
File:–hostd.log , Location:- /var/run/log
2019-05-05T03:00:11.762Z cpu6:65575)WARNING: Heartbeat: 794: PCPU 5 didn't have a heartbeat for 7 seconds; *may* be locked up.
2019-05-05T03:00:32.763Z cpu7:856683)WARNING: Heartbeat: 794: PCPU 5 didn't have a heartbeat for 7 seconds; *may* be locked up.
2019-05-05T03:04:23.769Z cpu0:885462)WARNING: Heartbeat: 794: PCPU 7 didn't have a heartbeat for 8 seconds; *may* be locked up.
From vmkwarning logs we observe TX hang
File:–vmkwarning.log, Location:- /var/run/log
$ grep -i "Tx hang" vmkwarning.*
vmkwarning.log:2019-04-28T10:29:48.929Z cpu46:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting
vmkwarning.log:2019-04-28T10:30:00.136Z cpu74:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting
vmkwarning.log:2019-04-29T04:58:51.730Z cpu56:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting
Rule stats collection has to be disabled end to end ( NSX manager and ESXI hosts) to avoid any PCPU lockup and subsequent latencies.
• NSX 6.4.4 does not allow to disable the feature locally at the ESXi host level, we can only disable stats collection at NSX Manager level.
• The fix to disable locally and globally is only made available in 6.4.5, the best way forward to avoid this issue is to plan an upgrade to 6.4.5.
• The impact on latency during rule stats collection is scheduled for a fix in the next release of NSX. .
Below workaround is recommended for NSX Manager and Hosts running 6.4.4
Steps to disable Rule Stats collection every 5 minutes from NSX Manager.
1. Take configuration backup of NSX Manager using steps mentioned in the following link: Back Up NSX Manager Data
2.Get the config of "FirewallStats Collection" parameter at NSX manager using below API:
API:- GET https://{{MGR_IP}}//api/4.0/firewall/config/globalconfiguration
<globalConfiguration>
<layer3RuleOptimize>false</layer3RuleOptimize>
<layer2RuleOptimize>true</layer2RuleOptimize>
<tcpStrictOption>false</tcpStrictOption>
<ruleStatsDisabled>false</ruleStatsDisabled>
</globalConfiguration>
3.Disable the "Firewall Stats Collection" by changing parameter "ruleStatsDisabled" to true using below API request.
PUT https://{{MGR_IP}}//api/4.0/firewall/config/globalconfiguration
<ruleStatsDisabled>true</ruleStatsDisabled>