The NSX Edge datapath dp-fw-purge threads crash frequently with DPDK PANIC/assertion failures due to firewall state corruption during the purge of expired connection states
Symptoms
<Timestamp> <NSX-Edge> NSX 1534696 FIREWALL [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="firewalldp" tname="dp-fw-purge10" level="INFO"] pf_state_list_op_error:error in op:REMOVE for state:0x******** id:<id> <int id> state_list:0x******** target_list:(nil) core_id:2 local_flags:1 state_flags:65671 timeout:22 ref_flags:0 kif:0x******** refcount:1 lookup_cnt:0 rule_id:0 sync_flags: **** lb_flags: 0
Example snippet of the backtrace
(gdb) bt
#0 0x************ in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x************ in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x************ in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x************ in __rte_panic (funcname=funcname@entry=0x************ <__func__.******> "dpdk_panic", format=format@entry=0x************ "assert failed%.0s") at ../lib/eal/common/eal_common_debug.c:20
#4 0x************ in dpdk_panic () at datapath/pf/pf_glue/glue_dpdk.c:492
#5 0x************ in panic (fmt=fmt@entry=0x************ "line %d\tassert \"cur->nat_rule.ptr->states >= 1\" failed\n") at datapath/pf/pf_glue/glue.c:200
#6 0x************ in pf_free_state (kif=kif@entry=0x************, cur=0x************) at datapath/pf/pf/pf.c:4267
#7 0x************ in pf_purge_expired_states (kif=kif@entry=0x************, maxcheck=****, maxcheck@entry=****, all=all@entry=1, coreid=1) at datapath/pf/pf/pf.c:4549
#8 0x************ in dpdk_purge_state (cookie=cookie@entry=0x************, coreId=<optimized out>) at datapath/pf/pf_glue/glue.c:2204
#9 0x************ in firewall_sp_purge_timer_callback (timer=0x*, cb=<optimized out>) at datapath/firewall.c:8152
#10 firewall_purge_thread (args=<optimized out>) at datapath/firewall.c:8214
#11 0x************ in ovsthread_wrapper (aux_=0x************) at edge/openvswitch/lib/ovs-thread.c:296
#12 0x************ in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x************ in ?? () from /lib/x86_64-linux-gnu/libc.so.6
NSX-T 4.2.1.4
A race condition occurs between the fast path packet processing thread and the purge thread during the movement of state entries between lists. Specifically, when the packet processing thread moves an entry from the "state list" to the "unlink state list," it fails to adjust the cursor. When purge thread tries to move states after expiry after cursor issue, it traverses the wrong list which result in rules connection count to be 0 or negative values which resulted in core dumps.
This issue is scheduled to be resolved in a future version of VMware NSX. Currently, there is no known workaround available