DFW memory very high critical alarm generated for host TNs, vsip-attr threshold has reached at 75% or more.
search cancel

DFW memory very high critical alarm generated for host TNs, vsip-attr threshold has reached at 75% or more.

book

Article ID: 367114

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:

  • Alarms are generated stating "The DFW Memory usage vsip-attr on Transport Node <UUID-of-TN> has reached X% which is at or above the very high threshold value of 75%." 

Environment

NSX-T version 3.2.X and above

NSX-T version 4.x and above

Cause

This occurs when Layer 7 (L7) Distributed Firewall (DFW) rules process a high volume of traffic, causing the vsip-attr heap usage to exceed thresholds.

Example:

An environment with L7 DNS rules applied to heavily trafficked Domain Controller VMs may experience this issue. The large number of DNS queries inspected by the L7 rule can cause vsip-attr heap usage to exceed the 75% threshold.

Resolution

Workaround

  • Add affected VMs to the DFW exclusion list to immediately alleviate the issue.

Permanent Fix

  • Reduce the number of L7 DFW rules
  • Reduce the scope of the L7 DFW rules to limit the volume of traffic inspected.

Additional Information

Find the affected VM and Slot 2 nic. We will look into the "L7 attr error" further to align the issue. You will need the information below to find the affected VM/s and may have to run against multiple VM for a full comparison. 

SSH into the affected ESXi host with the alarm.

Run the below to check the heap size and match to the UI alarm.

[root@esxi-customer:~] nsxcli -c get firewall thresholds
                                Firewall Threshold Monitors
-------------------------------------------------------------------------------------------

 #      Name      Raised  Threshold  CurrValue  CurrSize   MaxSize  PeakEver  EverTime(ago)
 1    cfgagent    False      100         5       139 MB    2500 MB     5      11d 12:41:18
 2    dfw-cpu     False       90         0         --        --        52       13:48:26
 3  dfw-session   False       80         5         --        --        23      4d 01:27:27
 4  nsx-exporter  False      100         19      149 MB    768 MB      22      2d 05:40:06
 5    nsx-idps    False      100         14      302 MB    2048 MB     15     11d 21:39:52
 6      vdpi      False      100         47      485 MB    1024 MB     47      4d 13:37:44
 7   vsip-attr    False       90         89      1144 MB   1280 MB     93      6d 08:59:47vsip <-------HERE
 8   vsip-flow    False       90         0        2 MB     768 MB      0        --:--:--
 9  vsip-fprules  False       90         0        6 MB     2560 MB     0        --:--:--
10   vsip-fqdn    False       90         2        12 MB    512 MB      3       2d 23:12:18
11  vsip-ipreputation  False       90         0        0 MB     256 MB      0        --:--:--
12  vsip-module   False       90         21      539 MB    2560 MB     24     11d 13:26:58
13   vsip-rules   False       90         0        5 MB     3070 MB     0        --:--:--
14    vsip-si     False       90         0        0 MB     128 MB      0        --:--:--
15   vsip-state   False       90         0        0 MB     512 MB      0        --:--:--

This ESXi host currently has the alarm in NSX-T UI. vsip-attr heap under Threshold is currently at 89%.

--------------------------------------------------------

Next run command =  summarize-dvfilter | grep -i "VMname" -A9

Use Output = name: nic-2580247-eth0-vmware-sfw.2

Use the nic-number-eth0-vmware-sfw.2 to get the reset of the criteria. 

run = vsipioctl getfilterstat -f nic-6090470-eth0-vmware-sfw.2 | grep -E "DROP REASON | L7 attr error"

DROP REASON
L7 attr error:        423066 <---------HERE

This error would increase during the issue and keep incrementing. Vmotion of VM might cause a reset of some criteria. Run this command again later to how much attr error increases.

run = vsipioctl getfilterstat -f nic-6090470-eth0-vmware-sfw.2 | grep -E "DROP REASON | L7 attr error"

DROP REASON
L7 attr error:        425066 <--------HERE

We can see the Drop Reason value increase. 

-----------------------------------------------------
Here we can see a large number of active flows for UDP. The other flows are useful to understand but, in this example we will focus on UDP.

Next, run the following commands to get flows.

run = vsipioctl getfloodstat -f nic-2109705-eth0-vmware-sfw.2

UDP:

    Flood Protection        : disabled

    Active UDP Flows        = 26063 <------HERE

ICMP:

    Flood Protection        : disabled

    Active ICMP Flows       = 20

OTHER:

    Flood Protection        : disabled

    Active OTHER Flows      = 0

TCP Half-open:

    Flood Protection        : disabled

    Active TCP Flows        = 51


This DC VM was the cause of the heap threshold being over. 
-


At this time you can proceed with any of the options listed in the Workaround section.