Multiple core dumps (vmkernel-zdump & opsAgent-zdump) observed on ESX host running 4.0.1.1 build, during upgrade from 4.0.1.1 to 4.2.2.0
search cancel

Multiple core dumps (vmkernel-zdump & opsAgent-zdump) observed on ESX host running 4.0.1.1 build, during upgrade from 4.0.1.1 to 4.2.2.0

book

Article ID: 400837

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

Upgrade ran from 3.1.0.0 ->  4.0.1.1 -> 4.2.2.0.
Vmkernel core dump observed on hitting out-of-memory condition on host running 4.0.1.1 build

Environment

Upgrade from 4.0.1.1 -> 4.2.2.0 with enough number of VMs & enough IPs within the groups associated with the DFW rulesets, so that the vmotion replication can cause out-of-memory issues during config import. Crash is observed when host is running 4.0.1.1 build with such scale config.

Cause

ESX host is running 4.0.1.1 build with enough number of VMs & enough IPs within the groups associated with the DFW rulesets, so that the vmotion replication can cause out-of-memory issues during config import. 

Impact to customer: host crashes and reboots.

 

Logs:
2025-03-22T07:58:35.407Z nsx-exporter[2510861]: NSX 2510861 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2510881" level="ERROR" e rrorCode="MPA11015"] vsip-fprules threshold event is raised. Threshold value is 90; current value is 99.

^^ timestamp for nsx-syslog showing config reached beyond the threshold value


..
33229 2025-03-22T08:20:01.361Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33230 2025-03-22T08:20:01.366Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33231 2025-03-22T08:20:01.460Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33232 2025-03-22T08:20:01.516Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33233 2025-03-22T08:20:01.516Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33234 2025-03-22T08:20:01.579Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33235 2025-03-22T08:20:01.616Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
33236 2025-03-22T08:20:01.616Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl
..
33241 2025-03-22T08:20:01.651Z cpu8:2510530)pfp_insert_ruleid: failed to allocate pf_rule_list form pfr_rulelist_pl (rule_id 2019)
33242 2025-03-22T08:20:01.651Z cpu8:2510530)pfp_add: failed for rule
33243 2025-03-22T08:20:01.651Z cpu8:2510530)pfp_del_ruleid: rule not found 2019 rs 5
33244 2025-03-22T08:20:01.651Z cpu8:2510530)pfioctl: failed to add rules (0)
33245 2025-03-22T08:20:01.651Z cpu8:2510530)VSIPConversionCreateRuleSet: Cannot insert #24 rule 2019: 22

^^ timestamp indicating OOM was hit, causing rule insert to fail in vmkernel logs


33698 2025-03-22T08:26:18.538Z cpu5:3696612)Backtrace for current CPU #5, worldID=3696612, fp=0x1

^^ timestamp of backtrace generation for PSOD


At 2025-03-22T07:58:35, config reached beyond threshold value, alarm will be raised for this.

Thereafter, out-of-memory (OOM) error msgs are seen in vmkernel logs eventually leading to crash (PSOD).

 

Backtrace:
(gdb) bt
#0 pfr_kif_ktbp_compare (n2=0xa5a00, n1=0x453896818e08) at datapath/esx/modules/vsip/vsip_pf/pf/pf_table.c:256
#1 pfr_kif_ktbp_head_RB_FIND (head=head@entry=0x43297b372060, elm=elm@entry=0x453896818e08) at datapath/esx/modules/vsip/vsip_pf/pf/pf_table.c:250
#2 0x00004200246989f3 in pf_sort_addr_rules (kif=kif@entry=0x4328d5990e78, rlist=rlist@entry=0x453896818fc0, max_nr=max_nr@entry=0x453896818f88,
    snodes=snodes@entry=0x453896819200, scount=scount@entry=1, dnodes=0x453896819300, dcount=0, rs_num=1, pd=0x45389681ad20, rs=<optimized out>)
    at datapath/esx/modules/vsip/vsip_pf/pf/pf_policy_lookup.c:252
#3 0x000042002469afed in pfp_policy_lookup (kif=kif@entry=0x4328d5990e78, policy_lookup_ctrl=policy_lookup_ctrl@entry=0x453896819570, ruleset=0x4328d5a315c8,
    pd=pd@entry=0x45389681ad20, sport=<optimized out>, dport=<optimized out>, direction=1, ac=0x0, curr_attr_state=0x0, tm=0x453896819548)
    at datapath/esx/modules/vsip/vsip_pf/pf/pf_policy_lookup.c:967
#4 0x000042002463634d in pf_test_udp (rm=rm@entry=0x45389681ac80, jump_rm=jump_rm@entry=0x45389681ac90, ids_rm=ids_rm@entry=0x45389681ac88,
    sm=sm@entry=0x45389681ac98, prlist=prlist@entry=0x45795fc80800, direction=direction@entry=1, kif=0x4328d5990e78, m=0x45389681aec8, off=20, h=0x45790229f7ce,
    rlookup=1 '\001', rule_type=0, curr_attr_state=0x0, next_attr_state=0x45389681ac74, ac=0x0, sip_persist=0x45389681aca8, lb_ctx=0x0, reason=0x45389681ac68,
    pd=0x45389681ad20, ethtype=8, am=0x45389681ac78, rsm=0x45389681aca0, ifq=0x0, inp=0x0) at datapath/esx/modules/vsip/vsip_pf/pf/pf.c:6046
#5 0x00004200246460d9 in pf_test (dir=dir@entry=1, ifp=ifp@entry=0x4328d5990e08, m0=m0@entry=0x45389681ae90, eh=eh@entry=0x45790229f7c0, ethHdrLen=14,
    ethtype=ethtype@entry=8, inp=0x0, metadata=0x45389681aea0, check_only=0, flow_entry=0x45389681af80) at datapath/esx/modules/vsip/vsip_pf/pf/pf.c:13861
#6 0x00004200246cc870 in PFFilterPacket (cookie=0x4328d5990e08, fragsList=0x45389681b250, dvDir=VMK_DVFILTER_FROM_SWITCH, source=<optimized out>,
    verdict=0x45389681b4f8, checkStateOnly=<optimized out>, flowMetaData=0x45389681b648) at datapath/esx/modules/vsip/vsip_pf/pf_vmk/glue.c:3513
#7 0x00004200245ecd04 in VSIPDVFProcessPacketsInt (filterImpl=<optimized out>, pktList=<optimized out>, direction=<optimized out>, ensData=<optimized out>)
    at datapath/esx/modules/vsip/vsip_dvfilter.c:4067
#8 0x000042002323b495 in ?? ()
#9 0x000045389681b870 in ?? ()
#10 0x0000000000000001 in ?? ()
#11 0x0000000000000001 in ?? ()
#12 0x00004302ea0530f0 in ?? ()
#13 0x00004302ea052e00 in ?? ()
#14 0x000043016e02a390 in ?? ()
#15 0x000045389681b930 in ?? ()
#16 0x0000420022829042 in IOChain_Resume (port=0x45389681b930, port@entry=0x4302ea052e00, chain=0x453891c9f000, chain@entry=0x4302ea0530f0,
    prevLink=prevLink@entry=0x0, pktList=0x45389681b840, pktList@entry=0x45389681b930, remainingPktList=remainingPktList@entry=0x45389681b910)
    at bora/vmkernel/net/iochain.c:881
#17 0x000042002286d5b3 in PortOutput (port=port@entry=0x4302ea052e00, prev=prev@entry=0x0, pktList=pktList@entry=0x45795ebfe1a8) at bora/vmkernel/net/port.c:4049
#18 0x00004200228a84c6 in Port_Output (pktList=0x45795ebfe1a8, port=0x4302ea052e00) at bora/vmkernel/net/port.h:976
#19 vmk_PortOutput (portID=portID@entry=100663338, pktList=pktList@entry=0x45795ebfe1a8, mayModify=mayModify@entry=0 '\000') at bora/vmkernel/net/vmkapi_net_port.c:375
#20 0x00004200243044a0 in VSwitchPortOutput (mayModify=0 '\000', pktList=0x45795ebfe1a8, dstPortId=100663338, srcPortId=100663355)
    at datapath/esx/modules/vswitch/vswitch.c:5754
#21 VSwitchForwardLeafPorts (ps=ps@entry=0x4302ea003138, srcPortId=srcPortId@entry=100663355, dispatchData=dispatchData@entry=0x45795ebfe078,
    completionList=completionList@entry=0x45389681bab0) at datapath/esx/modules/vswitch/vswitch.c:6089
#22 0x000042002430b949 in VSwitchPortDispatch (ps=<optimized out>, pktList=<optimized out>, srcPortID=<optimized out>) at datapath/esx/modules/vswitch/vswitch.c:7607
#23 0x000042002286d7e6 in Portset_Input (pktList=0x45389681bdd0, port=0x4302ea07edc0) at bora/vmkernel/net/portset.h:1720
#24 Port_InputResume (port=port@entry=0x4302ea07edc0, prev=prev@entry=0x0, pktList=pktList@entry=0x45389681bdd0) at bora/vmkernel/net/port.c:4134
#25 0x000042002286d8ce in Port_Input (port=port@entry=0x4302ea07edc0, pktList=pktList@entry=0x45389681bdd0) at bora/vmkernel/net/port.c:2769
#26 0x00004200228dcd69 in Vmxnet3VMKDevTQDoTx (port=port@entry=0x4302ea07edc0, tqIdx=tqIdx@entry=0, callSite=<optimized out>)
    at bora/vmkernel/net/vmxnet3_vmkdev.c:4737
#27 0x00004200228e62ec in Vmxnet3VMKDev_AsyncTx (port=0x4302ea07edc0, qidMap=0) at bora/vmkernel/net/vmxnet3_vmkdev.c:5314
#28 0x0000420022858122 in NetWorldPerVMCBInt (vmInfo=0x430051a68540) at bora/vmkernel/net/net_world.c:56
#29 NetWorldPerVMCB (data=0x430051a68540) at bora/vmkernel/net/net_world.c:139
#30 0x00004200229b492a in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:12073
#31 0x00004200226c4cc0 in ?? () at bora/vmkernel/main/debug.c:4019
#32 0x0000000000000000 in ?? ()

Resolution

Issue is fixed in 4.2 & later releases. Any release prior to 4.2 would be exposed to this issue.

Additional Information

Workaround:

1.Ensure overall rules on ESXI host not crossing the limit as per the Broadcom config max guide.

https://configmax.broadcom.com/guest?vmwareproduct=VMware%20NSX&release=NSX-T%20Data%20Center%203.2.1&categories=19-34

2.Move VMs to DFW exclusion list on the cluster that is being upgraded.