NSX DFW drops a VM's traffic on one host, restored by vMotion
search cancel

NSX DFW drops a VM's traffic on one host, restored by vMotion

book

Article ID: 441153

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

A virtual machine loses network connectivity to expected destinations while running on a particular ESXi host, and migrating the VM to a different host restores connectivity. The condition is intermittent, often appearing after the VM is relocated to a given host by DRS or by a manual migration, and is difficult to reproduce on demand.

Common observations reported with this issue:

  • A VM loses connectivity to some or all destinations only while on a specific ESXi host.
  • Other VMs on the same host are unaffected.
  • Migrating the affected VM to another host restores connectivity, and the restoration persists.
  • In the NSX Manager UI, the Distributed Firewall (DFW) rules show a published status of Success, yet traffic is still dropped.
  • The affected VM may be the vCenter Server appliance or another management VM.

This article is the starting point for isolating where the traffic is dropped and which known cause applies. The Resolution section walks the dataplane checks that confirm whether the DFW is the drop point, then routes to the article that matches the confirmed cause. Because migrating the VM clears the runtime state on the originating host, the verification steps and any packet captures are collected while the VM is still on the affected host, before the migration workaround is applied.

Environment

  • VMware NSX 4.x
  • VMware vSphere ESXi 8.x
  • NSX Distributed Firewall (DFW) enabled

Cause

The DFW filter applied to the VM's vNIC is not enforcing the expected ruleset on the affected host, so traffic falls through to a default deny or reject action and is dropped. Several distinct root causes produce this same symptom, and they are distinguished by the dataplane state observed during the failure: whether the filter has rules at all, whether the VM's current IP is present in the relevant address set, and whether the rule was applied and then removed. The verification steps in the Resolution identify which root cause applies, and each is addressed in its own article.

Resolution

Run the following while the VM is still on the affected host and the condition is active. Do not migrate the VM until verification and any captures are complete, because migration clears the failure state.

Step 1: Locate the VM's slot-2 DFW filter

Open an SSH session to the affected ESXi host as root and identify the slot-2 filter:

summarize-dvfilter | grep -i <VM-Name> -A 16

Note the slot-2 filter name in the form nic-XXXXXXXX-eth0-vmware-sfw.2, the world ID, the decimal port ID, and the failurePolicy value. A failurePolicy of failClosed means traffic is dropped whenever the filter has no valid ruleset. For the command and example output, see NSX-T DFW rules are not applied to VMs in security only environments.

Step 2: Check the dataplane ruleset and address sets

vsipioctl getrules -f nic-XXXXXXXX-eth0-vmware-sfw.2 vsipioctl getaddrsets -f nic-XXXXXXXX-eth0-vmware-sfw.2 vsipioctl getfwconfig -f nic-XXXXXXXX-eth0-vmware-sfw.2

These commands and their use against a VM's slot-2 filter are documented in NSX-T DFW rules not getting applied to virtual machines in NSX-T Security Only prepared cluster. Interpret the output as follows, then go to the matching article in Step 4.

ObservationWhat it indicates
getrules returns No rules or No root rule setThe filter has no ruleset on this host. Traffic is dropped by failClosed. Continue to the rule-realization causes in Step 4.
Rules are present, but the VM's current IP is missing from the expected address setAn address set / IP discovery binding problem. Continue to the IP discovery cause in Step 4.
Rules and address sets look correct, but a default reject or drop rule is taking the hitsTraffic is not matching an allow rule. Continue to the default-rule cause in Step 4.

Step 3: Confirm the DFW is the drop point with packet captures

To prove whether the DFW is dropping the traffic rather than a forwarding or overlay problem, capture before and after the filter while a continuous ping runs from the affected VM to the unreachable destination. Direct output to a datastore, not /tmp, using a case-specific subfolder on a non-vSAN datastore.

pktcap-uw --dvFilter nic-XXXXXXXX-eth0-vmware-sfw.2 --capture PreDVFilter --ng --count 200 -o /vmfs/volumes/<datastore>/<case>/vm_pre_dfw.pcapng pktcap-uw --dvFilter nic-XXXXXXXX-eth0-vmware-sfw.2 --capture PostDVFilter --ng --count 200 -o /vmfs/volumes/<datastore>/<case>/vm_post_dfw.pcapng

Packets present before the filter but absent after it confirm the DFW is dropping the traffic. The --capture PreDVFilter and --capture PostDVFilter syntax is documented in How to Capture Packets at DVFilter Level, and the requirement to write captures to a datastore rather than /tmp is noted in Packet capture on ESXi using the pktcap-uw tool. For the full data-path capture procedure (vNIC, switchport, uplink, and kernel capture points), see Datapath capture to Diagnose Datapath Connectivity Issues in NSX Environments.

Step 4: Match the confirmed cause to its article

Use the dataplane evidence from Step 2 to select the matching article. Each contains its own verification detail, affected versions, and resolution.

If getrules returns "No rules" or "No root rule set"

If rules are present but the VM's IP is missing from the address set

If rules and address sets are correct but a default reject or drop rule is taking hits

Step 5: Collect a full support bundle during the active condition, then restore service

Collect an NSX-generated support bundle while the condition is still present, including all NSX Manager nodes, the NSX Edge nodes, the affected ESXi host, and the destination ESXi host, plus a vCenter Server support bundle. A bundle collected during the active condition captures the configuration agent state and local control-plane state from the affected host, which is not present in a bundle collected after migration. After collection, migrate the VM to restore service and record the restoration time.

If none of the matching articles resolve the issue, contact Broadcom Support and provide the dataplane command output, the packet captures, the support bundle, and the start and restoration timestamps.

Additional Information

Cause-specific articles:

Capture and data-collection procedures: