NSX Edge Datapath Process (datapathd) Restarts with DPDK Panic in dp-fw-purge Threads
search cancel

NSX Edge Datapath Process (datapathd) Restarts with DPDK Panic in dp-fw-purge Threads

book

Article ID: 436126

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

The NSX Edge datapath dp-fw-purge threads crash frequently with DPDK PANIC/assertion failures due to firewall state corruption during the purge of expired connection states

Symptoms

  • Frequent restarts of the datapathd process on VMware NSX Edge nodes.
  • Momentary disruption of network traffic passing through the affected Edge node during the service restart.
  • Presence of core dump files in the /var/log/core directory of the Edge node, typically following the naming convention core.dp-fw-purge.
  • Edge syslogs contain error messages matching the following pattern:

<Timestamp> <NSX-Edge> NSX 1534696 FIREWALL [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="firewalldp" tname="dp-fw-purge10" level="INFO"] pf_state_list_op_error:error in op:REMOVE for state:0x******** id:<id> <int id> state_list:0x******** target_list:(nil) core_id:2 local_flags:1 state_flags:65671 timeout:22 ref_flags:0 kif:0x******** refcount:1 lookup_cnt:0 rule_id:0 sync_flags: **** lb_flags: 0

  • core file backtrace should have keyword pthread_kill () dpdk_panic () pf_purge_expired_states pf_free_state dpdk_purge_state firewall_sp_purge_timer_callback
Example snippet of the backtrace

(gdb) bt
#0  0x************ in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x************ in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x************ in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x************ in __rte_panic (funcname=funcname@entry=0x************ <__func__.******> "dpdk_panic", format=format@entry=0x************ "assert failed%.0s") at ../lib/eal/common/eal_common_debug.c:20
#4  0x************ in dpdk_panic () at datapath/pf/pf_glue/glue_dpdk.c:492
#5  0x************ in panic (fmt=fmt@entry=0x************ "line %d\tassert \"cur->nat_rule.ptr->states >= 1\" failed\n") at datapath/pf/pf_glue/glue.c:200
#6  0x************ in pf_free_state (kif=kif@entry=0x************, cur=0x************) at datapath/pf/pf/pf.c:4267
#7  0x************ in pf_purge_expired_states (kif=kif@entry=0x************, maxcheck=****, maxcheck@entry=****, all=all@entry=1, coreid=1) at datapath/pf/pf/pf.c:4549
#8  0x************ in dpdk_purge_state (cookie=cookie@entry=0x************, coreId=<optimized out>) at datapath/pf/pf_glue/glue.c:2204
#9  0x************ in firewall_sp_purge_timer_callback (timer=0x*, cb=<optimized out>) at datapath/firewall.c:8152
#10 firewall_purge_thread (args=<optimized out>) at datapath/firewall.c:8214
#11 0x************ in ovsthread_wrapper (aux_=0x************) at edge/openvswitch/lib/ovs-thread.c:296
#12 0x************ in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x************ in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Environment

NSX-T 4.2.1.4

Cause

A race condition occurs between the fast path packet processing thread and the purge thread during the movement of state entries between lists. Specifically, when the packet processing thread moves an entry from the "state list" to the "unlink state list," it fails to adjust the cursor. When purge thread tries to move states after expiry after cursor issue, it traverses the wrong list which result in rules connection count to be 0 or negative values which resulted in core dumps.

Resolution

This issue is scheduled to be resolved in a future version of VMware NSX. Currently, there is no known workaround available