High DFW Session Count on ESXi Transport Nodes
search cancel

High DFW Session Count on ESXi Transport Nodes

book

Article ID: 372896

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention VMware NSX Firewall

Issue/Introduction

The DFW session count is high on Transport node XXXX, it has reached xx% which is at or above the threshold value of 80%

 

  • Maximum Connection Limit: Each ESXi host supports a maximum of 2 million active connections.
  • Impact:
    • Exceeding the limit can drop new connections, affecting critical workloads
    • Exceeding the limit can cause vMotion failures.

 

Environment

VMware ESXi hosts with distributed firewall (DFW) enabled.

 

Cause

Each ESXi host has a hard limit of 2 million sessions. When this limit is exceeded, any new connections will be dropped. This can affect critical workloads that require initiating outbound traffic, potentially causing what appears to be an outage. Additionally, vMotions to the host will fail if the incoming VM brings connections that cause the total to exceed 2 million.

Resolution

1. Identify the VM causing the high session count. Follow these steps to diagnose and resolve the issue.

  1. Run the Diagnostic Script. (This diagnostic script is available on all the NSX-prepared ESXi hosts.)
    /usr/lib/vmware/vm-support/bin/vsipioctl_info.sh > /var/run/log/vsipioctl_info.sh_support_1.txt    (Note: This command may show an error -> ERROR: could not read port number PortNum,  which can be ignored.)

  2. Identify Active Connections for All VMs on the ESXi host:
    less /var/run/log/vsipioctl_info.sh_support_1.txt | grep "Active Conn"

  3. Get the Total Number of Active Connections:
    less /var/run/log/vsipioctl_info.sh_support_1.txt | grep "Active Conn" | awk '{total += $4} END {print total}'
  4. Identify VMs with High Connection Counts in the Past:
    less /var/run/log/vsipioctl_info.sh_support_1.txt | grep "High Water Mark"

  5. Find the NIC/Slot with High Connections:
    grep "<High Water Mark connection number from step d>" /var/run/log/vsipioctl_info.sh_support_1.txt -B 45 | grep nic

  6. Identify the VM with the Issue:
    summarize-dvfilter | grep <nic name from step e> -B 4

2. Perform packet capture to determine the traffic pattern.

pktcap-uw --capture PreDVFilter,PostDVFilter --dvfilter <Nic information from step 4e> --ng -o <Location for the capture file>.pcap (Note: Do not copy paste this command)

3.  Based on the identified traffic pattern one or more of the following steps can be used to remediate the issue

  1. Review the network traffic load level of the workloads on the host. Consider re-balancing the workloads on the host to other hosts.
  2. Enable Flood protection on DFW and/or Edge Firewall.
  3. Tweak/Create the session timer to aggressively age out the idle TCP sessions.
  4. Use the "TCP strict" for specific rule sections, this can protect the system from flows that don't follow the TCP state machine.
  5. DNS security can be configured to help guard against DNS-related attacks
  6. Check for the offending IPs that create a huge number of sessions, validate if those sessions are legitimate, and If not block them.
  7. Add the VMs to the DFW exclusion list, If the VMs are system VMs and the traffic is legit.