NSX-T Edge Load Balancer crashes multiple ngnix core dumps on Edge hitting a Race Condition

Products

VMware NSX

Issue/Introduction

NSX-T Data Center version 3.2.2 or 3.2.4
NSX-T Load Balancer configured with source IP persistence profile
UI reports high CPU usage on the Edge
nginx core dumps generated followed by High or 100% CPU on the nginx worker process
Clients cannot connect to the LB backend server
Recently updated the source IP persistence

Relevant log location
Log Indicating that a nginx coredump is generated

/var/log/syslog
2022-08-13T11:04:24.986Z edge-########.local NSX 22492 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.nginx.1660388664.16937.134.11.gz

Log indicating the load-balancer CPU usage is very high just after the generation of the coredump

/var/log/syslog
2022-08-13T11:09:12.986Z edge##.####.local NSX 2781 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="#############################################" tid="3145" level="WARNING" eventState="On" eventFeatureName="load_balancer" eventSev="warning" eventType="lb_cpu_very_high"] The CPU usage of load balancer #############################################is very high. The threshold is 95%.


edge> get processes
top - #####x up xx days, 17:09, 0 users, load average: 2.86, 2.16, 1.73
Tasks: 268 total, 4 running, 166 sleeping, 0 stopped, 14 zombie
%Cpu(s): 2.6 us, 2.6 sy, 0.0 ni, 94.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32734844 total, 6939616 free, 18403544 used, 7391684 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13488368 avail Mem

/opt/vmware/nsx-netopa/bin/agent.py
14111 lb 20 0 598552 84140 3672 R 100.0 0.3 3:56.79 14111 nginx: worker process
15422 lb 20 0 624816 83956 3404 R 100.0 0.3 0:45.09 15422 nginx: worker process

From the core file name, you can tell the crashed process is the L4 CP process (child process) using /var/log/vmware/top-mem.log.


top - xxxx up xx days,  8:11,  0 users,  load average: 2.68, 1.29, 0.87
Tasks: 342 total,   4 running, 338 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.6 us,  3.8 sy,  0.0 ni, 75.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65360864 total,  7890508 free, 37557284 used, 19913072 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 16626536 avail Mem
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+    TGID COMMAND
3443132 lb        20   0 8768776 111728  91212 R 100.0   0.2   0:50.86 3443132 nginx: L4LB process
3443133 lb        20   0 8768776 112264  91748 R 100.0   0.2   0:50.65 3443133 nginx: L4LB CP process
3444375 root      20   0    3264   1716   1168 R 100.0   0.0   0:50.90 3444375 /bin/gzip
   9470 root      20   0  129.5g 297900 107316 S  58.8   0.5  33352:46    9470 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/va+
3444825 root      20   0    9700   3800   3092 R  11.8   0.0   0:00.04 3444825 top -n 1 -b -c -o %CPU -w180
   1571 nsx-sha   20   0 4733460  88660  22360 S   5.9   0.1   1634:08    1571 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
   8535 root      20   0  126996  18352  16348 S   5.9   0.0 755:09.40    8535 /opt/vmware/nsx-edge/bin/nsd
3443134 lb        20   0 8768776   3.5g   3.5g S   5.9   5.7   0:03.98 3443134 nginx: L4LB CP process  
      1 root      20   0  171560  13368   8488 S   0.0   0.0  17:20.62       1 /sbin/init splash
      2 root      20   0       0      0      0 S   0.0   0.0   0:03.61       2 [kthreadd]
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00       3 [rcu_gp]
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00       4 [rcu_par_gp]
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00       5 [slub_flushwq]

Environment

VMware NSX

Cause

When the persistence profile is removed from the VIP, the rbtree and hash list of persistence in this VIP would be released. But the existing L4 connection of this VIP still gets the persistence entry, which is inserted in the hash list.
In the fix of the bug , we would traverse all the L4 connections of the VIP to unref the used persistence entry when the persistence profile is removed from the VIP, after that, the entries would be deleted from the hash list. So the hash list would not be accessed again after it is released.

There is a race condition that during the traversing, there are new L4 connection creation / deletion request arrives on the L4 CP process, so it would try to change the L4 connection data structure which the master process is modifying the data structure to unref the persistent entry. The concurrent modification on the L4 connection causes the invalid value, which cause the coredump.

Resolution

Permanent fix:

NSX-T 3.2.4.3
NSX-T 4.2.2.2
NSX-T 4.2.3.1
NSX-T 9.0.1

Proactive Approach:
No log line can be seen before the crash , Hence, no proactive action can be taken for this condition.

Reactive Approach:
Once a crash has occurred and a coredump is generated, perform manual failover to reduce potential downtime as post CP crash the CPU will still continue to be high until a MP crash occurs.

The nginx process needs to be restarted completely to resolve the issue temporarily on the affected Edge nodes. There are multiple option which can be followed to perform this, below are the same. Note: anyone of them can be used.

1. Change the active edge node into maintenance mode and then exit maintenance mode
2. Restart Edge Node
3. Restart Load Balancer Docker - ps | grep edge-lb | awk '{print $1}' | xargs docker restart

Long-term Workaround
Change pool selection algorithm from round robin to IP hash and disable persistence in all virtual servers