Upgrade ran from 3.1.0.0 -> 4.0.1.1 -> 4.2.2.0.
Vmkernel core dump observed on hitting out-of-memory condition on host running 4.0.1.1 build
Upgrade from 4.0.1.1 -> 4.2.2.0 with enough number of VMs & enough IPs within the groups associated with the DFW rulesets, so that the vmotion replication can cause out-of-memory issues during config import. Crash is observed when host is running 4.0.1.1 build with such scale config.
ESX host is running 4.0.1.1 build with enough number of VMs & enough IPs within the groups associated with the DFW rulesets, so that the vmotion replication can cause out-of-memory issues during config import.
Impact to customer: host crashes and reboots.
Logs:2025-03-22T07:58:35.407Z nsx-exporter[2510861]: NSX 2510861 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2510881" level="ERROR" e rrorCode="MPA11015"] vsip-fprules threshold event is raised. Threshold value is 90; current value is 99.
^^ timestamp for nsx-syslog showing config reached beyond the threshold value..33229 2025-03-22T08:20:01.361Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33230 2025-03-22T08:20:01.366Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33231 2025-03-22T08:20:01.460Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33232 2025-03-22T08:20:01.516Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33233 2025-03-22T08:20:01.516Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33234 2025-03-22T08:20:01.579Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33235 2025-03-22T08:20:01.616Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl33236 2025-03-22T08:20:01.616Z cpu8:2510530)pfp_create_addr_entry_with_table: failed to allocate pf_fp_entry form pfr_fastpath_pl..33241 2025-03-22T08:20:01.651Z cpu8:2510530)pfp_insert_ruleid: failed to allocate pf_rule_list form pfr_rulelist_pl (rule_id 2019)33242 2025-03-22T08:20:01.651Z cpu8:2510530)pfp_add: failed for rule33243 2025-03-22T08:20:01.651Z cpu8:2510530)pfp_del_ruleid: rule not found 2019 rs 533244 2025-03-22T08:20:01.651Z cpu8:2510530)pfioctl: failed to add rules (0)33245 2025-03-22T08:20:01.651Z cpu8:2510530)VSIPConversionCreateRuleSet: Cannot insert #24 rule 2019: 22
^^ timestamp indicating OOM was hit, causing rule insert to fail in vmkernel logs33698 2025-03-22T08:26:18.538Z cpu5:3696612)Backtrace for current CPU #5, worldID=3696612, fp=0x1
^^ timestamp of backtrace generation for PSOD
At 2025-03-22T07:58:35, config reached beyond threshold value, alarm will be raised for this.
Thereafter, out-of-memory (OOM) error msgs are seen in vmkernel logs eventually leading to crash (PSOD).
Backtrace:(gdb) bt#0 pfr_kif_ktbp_compare (n2=0xa5a00, n1=0x453896818e08) at datapath/esx/modules/vsip/vsip_pf/pf/pf_table.c:256#1 pfr_kif_ktbp_head_RB_FIND (head=head@entry=0x43297b372060, elm=elm@entry=0x453896818e08) at datapath/esx/modules/vsip/vsip_pf/pf/pf_table.c:250#2 0x00004200246989f3 in pf_sort_addr_rules (kif=kif@entry=0x4328d5990e78, rlist=rlist@entry=0x453896818fc0, max_nr=max_nr@entry=0x453896818f88, snodes=snodes@entry=0x453896819200, scount=scount@entry=1, dnodes=0x453896819300, dcount=0, rs_num=1, pd=0x45389681ad20, rs=<optimized out>) at datapath/esx/modules/vsip/vsip_pf/pf/pf_policy_lookup.c:252#3 0x000042002469afed in pfp_policy_lookup (kif=kif@entry=0x4328d5990e78, policy_lookup_ctrl=policy_lookup_ctrl@entry=0x453896819570, ruleset=0x4328d5a315c8, pd=pd@entry=0x45389681ad20, sport=<optimized out>, dport=<optimized out>, direction=1, ac=0x0, curr_attr_state=0x0, tm=0x453896819548) at datapath/esx/modules/vsip/vsip_pf/pf/pf_policy_lookup.c:967#4 0x000042002463634d in pf_test_udp (rm=rm@entry=0x45389681ac80, jump_rm=jump_rm@entry=0x45389681ac90, ids_rm=ids_rm@entry=0x45389681ac88, sm=sm@entry=0x45389681ac98, prlist=prlist@entry=0x45795fc80800, direction=direction@entry=1, kif=0x4328d5990e78, m=0x45389681aec8, off=20, h=0x45790229f7ce, rlookup=1 '\001', rule_type=0, curr_attr_state=0x0, next_attr_state=0x45389681ac74, ac=0x0, sip_persist=0x45389681aca8, lb_ctx=0x0, reason=0x45389681ac68, pd=0x45389681ad20, ethtype=8, am=0x45389681ac78, rsm=0x45389681aca0, ifq=0x0, inp=0x0) at datapath/esx/modules/vsip/vsip_pf/pf/pf.c:6046#5 0x00004200246460d9 in pf_test (dir=dir@entry=1, ifp=ifp@entry=0x4328d5990e08, m0=m0@entry=0x45389681ae90, eh=eh@entry=0x45790229f7c0, ethHdrLen=14, ethtype=ethtype@entry=8, inp=0x0, metadata=0x45389681aea0, check_only=0, flow_entry=0x45389681af80) at datapath/esx/modules/vsip/vsip_pf/pf/pf.c:13861#6 0x00004200246cc870 in PFFilterPacket (cookie=0x4328d5990e08, fragsList=0x45389681b250, dvDir=VMK_DVFILTER_FROM_SWITCH, source=<optimized out>, verdict=0x45389681b4f8, checkStateOnly=<optimized out>, flowMetaData=0x45389681b648) at datapath/esx/modules/vsip/vsip_pf/pf_vmk/glue.c:3513#7 0x00004200245ecd04 in VSIPDVFProcessPacketsInt (filterImpl=<optimized out>, pktList=<optimized out>, direction=<optimized out>, ensData=<optimized out>) at datapath/esx/modules/vsip/vsip_dvfilter.c:4067#8 0x000042002323b495 in ?? ()#9 0x000045389681b870 in ?? ()#10 0x0000000000000001 in ?? ()#11 0x0000000000000001 in ?? ()#12 0x00004302ea0530f0 in ?? ()#13 0x00004302ea052e00 in ?? ()#14 0x000043016e02a390 in ?? ()#15 0x000045389681b930 in ?? ()#16 0x0000420022829042 in IOChain_Resume (port=0x45389681b930, port@entry=0x4302ea052e00, chain=0x453891c9f000, chain@entry=0x4302ea0530f0, prevLink=prevLink@entry=0x0, pktList=0x45389681b840, pktList@entry=0x45389681b930, remainingPktList=remainingPktList@entry=0x45389681b910) at bora/vmkernel/net/iochain.c:881#17 0x000042002286d5b3 in PortOutput (port=port@entry=0x4302ea052e00, prev=prev@entry=0x0, pktList=pktList@entry=0x45795ebfe1a8) at bora/vmkernel/net/port.c:4049#18 0x00004200228a84c6 in Port_Output (pktList=0x45795ebfe1a8, port=0x4302ea052e00) at bora/vmkernel/net/port.h:976#19 vmk_PortOutput (portID=portID@entry=100663338, pktList=pktList@entry=0x45795ebfe1a8, mayModify=mayModify@entry=0 '\000') at bora/vmkernel/net/vmkapi_net_port.c:375#20 0x00004200243044a0 in VSwitchPortOutput (mayModify=0 '\000', pktList=0x45795ebfe1a8, dstPortId=100663338, srcPortId=100663355) at datapath/esx/modules/vswitch/vswitch.c:5754#21 VSwitchForwardLeafPorts (ps=ps@entry=0x4302ea003138, srcPortId=srcPortId@entry=100663355, dispatchData=dispatchData@entry=0x45795ebfe078, completionList=completionList@entry=0x45389681bab0) at datapath/esx/modules/vswitch/vswitch.c:6089#22 0x000042002430b949 in VSwitchPortDispatch (ps=<optimized out>, pktList=<optimized out>, srcPortID=<optimized out>) at datapath/esx/modules/vswitch/vswitch.c:7607#23 0x000042002286d7e6 in Portset_Input (pktList=0x45389681bdd0, port=0x4302ea07edc0) at bora/vmkernel/net/portset.h:1720#24 Port_InputResume (port=port@entry=0x4302ea07edc0, prev=prev@entry=0x0, pktList=pktList@entry=0x45389681bdd0) at bora/vmkernel/net/port.c:4134#25 0x000042002286d8ce in Port_Input (port=port@entry=0x4302ea07edc0, pktList=pktList@entry=0x45389681bdd0) at bora/vmkernel/net/port.c:2769#26 0x00004200228dcd69 in Vmxnet3VMKDevTQDoTx (port=port@entry=0x4302ea07edc0, tqIdx=tqIdx@entry=0, callSite=<optimized out>) at bora/vmkernel/net/vmxnet3_vmkdev.c:4737#27 0x00004200228e62ec in Vmxnet3VMKDev_AsyncTx (port=0x4302ea07edc0, qidMap=0) at bora/vmkernel/net/vmxnet3_vmkdev.c:5314#28 0x0000420022858122 in NetWorldPerVMCBInt (vmInfo=0x430051a68540) at bora/vmkernel/net/net_world.c:56#29 NetWorldPerVMCB (data=0x430051a68540) at bora/vmkernel/net/net_world.c:139#30 0x00004200229b492a in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:12073#31 0x00004200226c4cc0 in ?? () at bora/vmkernel/main/debug.c:4019#32 0x0000000000000000 in ?? ()
Issue is fixed in 4.2 & later releases. Any release prior to 4.2 would be exposed to this issue.
Workaround:
1.Ensure overall rules on ESXI host not crossing the limit as per the Broadcom config max guide.
2.Move VMs to DFW exclusion list on the cluster that is being upgraded.