ESXi host PSODs due to DFW rule memory exhaustion after NSX upgrade.
search cancel

ESXi host PSODs due to DFW rule memory exhaustion after NSX upgrade.

book

Article ID: 409770

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

In NSX-T environments, ESXi hosts may experience a PSOD during or after an upgrade to NSX 4.2.x when the number of DFW rules on the host exceeds the supported configuration limits.

Environment

  • VMware NSX-T Data Center 3.2.x and 4.2.x

  • VMware vSphere ESXi 7.x and later

  • vDefend Firewall (DFW) enabled.

Cause

  • In NSX 3.2.x, the DFW rules and address sets/groups were stored in a shared 3 GB heap memory. This allowed higher rule counts to be realized, although unsupported.

  • Starting in NSX 4.2.1 and later, heap memory allocation was redesigned:

    • 3 GB heap dedicated for address sets/groups (vsip-kentries)

    • 1 GB heap allocated exclusively for DFW rules (vsip-rules)

  • When DFW rule count per ESXi host exceeds the supported maximum (~120K per host, ≤4K per vNIC), the vsip-rules heap exhausts, leading to memory allocation failures.

Example snippet of the vsip-rules heap and its utilization from the support bundle.

          Note : The following command can be used to review firewall thresholds on a live host:

                     ESXi:~] nsxcli
                     ESXi> get firewall thresholds

  • This condition can leave firewall rules in an inconsistent state and may cause an ESXi host PSOD.

Symptoms:

  • ESXi host crashes (PSOD) during or after NSX upgrade.

    (gdb) bt
    #0  fp2_rulematch_set (rs=0x************, rs_num=<optimized out>, rule=<optimized out>, flags=3,
        fprl=0x************, kif=0x************, str="wildcard") at …/pf_policy_lookup.c:****
    #1  pf_sort_wildcard_rules (pd=0x************, rs_num=1, max_nr=0x************, rlist=0x************,
        rs=0x************, kif=0x************) at …/pf_policy_lookup.c:****
    #2  pfp_policy_lookup (kif=0x************, policy_lookup_ctrl=0x************, ruleset=0x************,
        pd=0x************, sport=<optimized out>, dport=<optimized out>, direction=2, ac=0x0,
        curr_attr_state=0x************, tm=0x************) at …/pf_policy_lookup.c:****
    #3  0x**************** in pf_test_tcp (rm=0x************, jump_rm=0x************, ids_rm=0x************,
        sm=0x************, prlists=<optimized out>, direction=2, kif=0x************, m=0x************,
        off=20, h=0x************, rlookup=1, rule_type=0, curr_attr_state=0x************,
        next_attr_state=0x************, ac=0x0, sip_persist=0x0, lb_ctx=0x0, reason=0x************,
        pd=0x************, ethtype=8, am=0x************, rsm=0x************, ifq=0x0, inp=0x0)
        at …/pf.c:****
    #4  0x**************** in pf_validate_state_v2 (kif=0x************, state=0x************, rule=0x************,
        jump_rule=0x************, ids_rule=0x************, anchor_rule=0x************, orig_pd=0x************,
        ethtype=8, paction=0x************, rule_type=0, next_attr_state=0x************, waslocked=0)
        at …/pf.c:****
    #5  0x**************** in pf_validate_session_v2 (kif=0x************, m=<optimized out>, state=0x************,
        pd=0x************, ethtype=<optimized out>, direction=<optimized out>, waslocked=0) at …/pf.c:****
    #6  0x**************** in pf_validate_session (direction=2, ethtype=8, pd=0x************, state=<optimized out>,
        m=0x************, kif=0x************) at …/pf.c:****
    #7  pf_test_state_tcp (state=0x************, direction=2, kif=0x************, m=0x************, off=20,
        h=0x************, pd=0x************, ethtype=8, reason=0x************, check_only=0,
        check_dnat_out=0, drop_rst=0x************) at …/pf.c:****
    #8  0x**************** in pf_test (dir=2, ifp=0x************, m0=0x************, eh=0x************,
        ethHdrLen=14, ethtype=8, inp=0x0, metadata=0x************, check_only=0, flow_entry=0x************)
        at …/pf.c:****
    #9  0x**************** in PFFilterPacket (cookie=0x************, fragsList=0x************,
        dvDir=VMK_DVFILTER_TO_SWITCH, source=<optimized out>, verdict=0x************,
        checkStateOnly=<optimized out>, flowMetaData=0x************) at …/glue.c:****
    #10 0x**************** in VSIPFWProcessPackets (solution=0x************, filter=0x************,
        pktList=0x************, direction=VMK_DVFILTER_TO_SWITCH, source=VSIP_DVFILTER_SOURCE_REGULAR,
        action=0x************, checkStateOnly=0, flowMetaData=0x************) at …/vsip_fw.c:****
    #11 0x**************** in VSIPDVFProcessPacketsInt (filterImpl=0x************, pktList=<optimized out>,
        direction=<optimized out>, ensData=<optimized out>) at …/vsip_dvfilter.c:****
    #12 0x**************** in ?? ()
    #13 0x0000000000000000 in ?? ()

  • High DFW rule counts observed per host (400K–700K).

  • vmkernel logs report memory allocation failures during rule commit, for example:

    2025-08-22T15:10:47.844Z cpu66:259195232)pfioctl: DIOCADDRULE No memory to create rule structure
    2025-08-22T15:10:47.844Z cpu66:259195232)VSIPConversionCreateRuleSet: Cannot insert #122 rule 154073: 12
    2025-08-22T15:10:47.844Z cpu66:259195232)pf_rollback_rules: nic-259448837-eth157914-vmware-sfw.1, rs_num: 1, anchor: mainrs...
    2025-08-22T15:09:15.249Z cpu42:259195232)pfioctl: DIOCXCOMMIT rules ina_commit failed
    2025-08-22T15:09:15.249Z cpu42:259195232)VSIPCommitTransaction: failed to commit transaction: 12
    2025-08-21T05:53:29.308Z cpu23:2106293)WARNING: Heap: 3648: Heap vsip-rules already at its maximum size. Cannot expand.
    2025-08-21T05:58:03.661Z cpu75:2106293)WARNING: Heap: 3648: Heap vsip-rules already at its maximum size. Cannot expand.
    2025-08-21T06:00:29.258Z cpu52:2106293)WARNING: Heap: 3648: Heap vsip-rules already at its maximum size. Cannot expand.
    2025-08-21T06:02:40.428Z cpu59:2106293)WARNING: Heap: 3648: Heap vsip-rules already at its maximum size. Cannot expand.
  • Rule realization failures when publishing firewall policies.

Resolution

Workaround:

  • Reduce the effective rule count per ESXi host to within supported configuration maximums:

    • ≤120,000 rules per host

    • ≤4,000 rules per vNIC

  • Recommended approaches:

    1. Add non-essential or container-based VMs to the DFW exclusion list.

    2. Move “ANY ANY ALLOW” type rules to the top if appropriate to avoid unnecessary rule expansion.

    3. Audit and remove redundant, duplicate, or irrelevant firewall rules.

Resolution:

  • This is not a software defect; the behavior is by design starting with NSX 4.2.1 due to dedicated heap allocations.

  • Ensure that firewall rule design follows supported configuration maximums.

  • Best practices:

    • Use the “Applied To” field with security groups instead of applying rules globally at the “DFW” level. This prevents unnecessary rule replication across all vNICs.

    • Periodically audit firewall rules to eliminate redundancy.

    • For environments with large-scale container workloads, create dedicated groups for container VMs and apply rules selectively.

Additional Information