After NSX-v upgrade from 6.3.4 to 6.4.4 VMs running on vSAN clusters start reporting performance issues
search cancel

After NSX-v upgrade from 6.3.4 to 6.4.4 VMs running on vSAN clusters start reporting performance issues

book

Article ID: 326458

calendar_today

Updated On:

Products

VMware NSX VMware vSAN

Issue/Introduction

Symptoms:
  • Multiple production VMs report Network and Storage performance issues 

  • High Write Latency observed in vSAN Cluster Performance Graphs 
  • High ping latency observed between ESXi hosts in the problem clusters  for the VTEP's

  • Intermittent latency from VMK interfaces to immediate Layer-3 Physical gateway
  • BGP flaps reported intermittently between NSX edge and uplink switch



Cause

  • Root cause for this issue is the introduction of rule stats collection in 6.4.2 which is enabled by default where NSX DFW UI provides FW “Rule hit Stats” information.

  • All the events from NSX side which would result in firewall publish and eventually lead to stat collection and network latency

  • The rule stats feature gets triggered every 5 minutes. During this stage there is a PCPU lockup on datapath that causes the latency.
    This feature can be disabled at Management plane (MP) but cannot be disabled at the dataplane till 6.4.4.

  •  PCPU lockups gets triggered during rule publish event also and this will cause similar latency.

    Graphical representation of DFW Rule stats Feature on NSX manager:

Log messages relevant to the issue 

Below logs can be checked on impacted esxi hosts 

From vsfwd logs we see a firewall publish event which acts as trigger for rule stats collection along with Rule stat hits which eventually triggers latency 

File:-vsfwd.log, Location:- /var/run/log

2019-05-02T15:42:00Z vsfwd: [INFO] Applied RuleSet 1556795924921 for all vnics
2019-05-02T15:42:00Z vsfwd: [INFO] Compressed config data from 3130764 to 406478 bytes2019-05-02T15:42:00Z vsfwd: [INFO] Successfully saved config to file /etc/vmware/vsfwd/vsipfw_ruleset.dat

2019-05-02T15:42:03Z vsfwd: [INFO] Filter nic-81423981-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:03Z vsfwd: [INFO] created rule hit count of filter nic-81423981-eth0-vmware-sfw.2
2019-05-02T15:42:06Z vsfwd: [INFO] Filter nic-81421968-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:06Z vsfwd: [INFO] created rule hit count of filter nic-81421968-eth0-vmware-sfw.2
2019-05-02T15:42:09Z vsfwd: [INFO] Filter nic-81422860-eth0-vmware-sfw.2 has 16662 rule hit counts of gennum 1556795924921
2019-05-02T15:42:09Z vsfwd: [INFO] created rule hit count of filter nic-81422860-eth0-vmware-sfw.2

Note: The rule stats get initiated whenever there is a publish of rule happens at data plane. 

From the hostd logs  we would see below triggers for PCPU locks: 

File:–hostd.log ,  Location:- /var/run/log 

2019-05-05T03:00:11.762Z cpu6:65575)WARNING: Heartbeat: 794: PCPU 5 didn't have a heartbeat for 7 seconds; *may* be locked up.
2019-05-05T03:00:32.763Z cpu7:856683)WARNING: Heartbeat: 794: PCPU 5 didn't have a heartbeat for 7 seconds; *may* be locked up.
2019-05-05T03:04:23.769Z cpu0:885462)WARNING: Heartbeat: 794: PCPU 7 didn't have a heartbeat for 8 seconds; *may* be locked up.

From vmkwarning logs  we observe TX hang

 File:–vmkwarning.log, Location:- /var/run/log   

$ grep -i "Tx hang" vmkwarning.*
vmkwarning.log:2019-04-28T10:29:48.929Z cpu46:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting
vmkwarning.log:2019-04-28T10:30:00.136Z cpu74:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting
vmkwarning.log:2019-04-29T04:58:51.730Z cpu56:65757)WARNING: i40en: i40en_SetResetFlags:10801: Tx hang detected, device resetting

Resolution

Rule stats collection has to be disabled end to end ( NSX manager and ESXI hosts) to avoid any PCPU lockup and subsequent latencies.

• NSX 6.4.4 does not allow to disable the feature locally at the ESXi host level, we can only disable stats collection at NSX Manager level.

• The fix to disable locally and globally is only made available in 6.4.5, the best way forward to avoid this issue is to plan an upgrade to 6.4.5.

• The impact on latency during rule stats collection is scheduled for a fix in the next release of NSX.  .


Workaround:

Below workaround is recommended for NSX Manager and Hosts running 6.4.4 

Steps to disable Rule Stats collection every 5 minutes from NSX Manager.

1. Take configuration backup of NSX Manager  using steps mentioned in the following link: Back Up NSX Manager Data 

2.Get the config of "FirewallStats Collection" parameter at NSX manager using below API:

API:- GET https://{{MGR_IP}}//api/4.0/firewall/config/globalconfiguration

<globalConfiguration>
<layer3RuleOptimize>false</layer3RuleOptimize>
<layer2RuleOptimize>true</layer2RuleOptimize>
<tcpStrictOption>false</tcpStrictOption>
<ruleStatsDisabled>false</ruleStatsDisabled>
</globalConfiguration>

3.Disable the "Firewall Stats Collection" by changing parameter "ruleStatsDisabled" to true using below API request.

PUT https://{{MGR_IP}}//api/4.0/firewall/config/globalconfiguration

<ruleStatsDisabled>true</ruleStatsDisabled>