Intermittent latency when using NSX 6.4.2 or above and a large amount of Distributed Firewall rules
search cancel

Intermittent latency when using NSX 6.4.2 or above and a large amount of Distributed Firewall rules

book

Article ID: 327299

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • Intermittent latency is observed on (some or all):
    • ESXi host vmk interfaces (Management, vMotion, storage, vSAN vmk interfaces etc.).
    • Virtual Machines.
  • NSX Distributed Firewall rule statistics collection is running when the latency is experienced as seen in ESXi host logs (vsfwd.log):
2019-05-02T15:42:03Z vsfwd: [INFO] Filter nic-81423981-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:03Z vsfwd: [INFO] created rule hit count of filter nic-81423981-eth0-vmware-sfw.2 2019-05-02T15:42:06Z vsfwd: [INFO] Filter nic-81421968-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
  • ESXi host logs (hostd.log) display message(s) similar to:
2019-05-02T15:42:39Z cpu73:82996231)WARNING: Heartbeat: 794: PCPU 44 didn't have a heartbeat for 7 seconds; *may* be locked up.
  • ESXi host logs (vmkwarning.log) may display similar to (depending on driver):
vmkwarning.log:2019-05-02T15:42:00.929Z cpu46:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting

Environment

VMware NSX Data Center for vSphere 6.4.x

Cause

The latency is caused by the NSX Distributed Firewall rule statistics collection which was introduced in NSX 6.4.2.
This new feature allows to track the number of hits per NSX Distributed Firewall rule.
It runs every 5 minutes interval and when a DFW publish operation is triggered. An unexpected side effect of the rule stats collection causes the network RX and TX operations to hang on the ESXi host for a brief period of time impacting all vmk interfaces and Virtual Machines on the ESXi hosts.
The latency depends on the number of Virtual Machines and DFW rules per Virtual Machines.

Resolution

This issue is resolved in NSX 6.4.6.

Issue 2337437 is documented in the NSX for vSphere 6.4.6 release notes: https://docs.vmware.com/en/VMware-NSX-Data-Center-for-vSphere/6.4/rn/releasenotes_nsx_vsphere_646.html

Workaround:
It is possible to disable NSX Distributed Firewall rule statistics collection using a REST API.
When disabling it in NSX 6.4.2, 6.4.3 or 6.4.4 only the 5 minutes periodic rule stats collection is disabled. The rule statistics collection caused by the NSX Distributed Firewall publish operations cannot be disabled in those version.
When disabling it in NSX 6.4.5, both 5 minutes periodic rule stats collection and the rule statistics collection caused by the NSX Distributed Firewall publish operations are disabled.

To disable NSX Distributed Firewall rule statistics collection, follow the steps below:

1. Retrieve the current DFW global configurations:
GET /api/4.0/firewall/config/globalconfiguration

Example of expected output:
<globalConfiguration>
  <layer3RuleOptimize>false</layer3RuleOptimize>
  <layer2RuleOptimize>true</layer2RuleOptimize>
  <tcpStrictOption>false</tcpStrictOption>
  <ruleStatsDisabled>false</ruleStatsDisabled>
</globalConfiguration>


2. Push the DFW global configuration with "<ruleStatsDisabled>true</ruleStatsDisabled>"

PUT /api/4.0/firewall/config/globalconfiguration

Example of expected input:
<globalConfiguration>
  <layer3RuleOptimize>false</layer3RuleOptimize>
  <layer2RuleOptimize>true</layer2RuleOptimize>
  <tcpStrictOption>false</tcpStrictOption>
  <ruleStatsDisabled>true</ruleStatsDisabled>
</globalConfiguration>