Relevant log location
Log Indicating that a nginx coredump is generated
/var/log/syslog
2022-08-13T11:04:24.986Z edge-########.local NSX 22492 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.nginx.1660388664.16937.134.11.gz
Log indicating the load-balancer CPU usage is very high just after the generation of the coredump
/var/log/syslog2022-08-13T11:09:12.986Z edge##.####.local NSX 2781 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="#############################################" tid="3145" level="WARNING" eventState="On" eventFeatureName="load_balancer" eventSev="warning" eventType="lb_cpu_very_high"] The CPU usage of load balancer#############################################is very high. The threshold is 95%.
edge> get processes
top - #####x up xx days, 17:09, 0 users, load average: 2.86, 2.16, 1.73
Tasks: 268 total, 4 running, 166 sleeping, 0 stopped, 14 zombie
%Cpu(s): 2.6 us, 2.6 sy, 0.0 ni, 94.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32734844 total, 6939616 free, 18403544 used, 7391684 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13488368 avail Mem
/opt/vmware/nsx-netopa/bin/agent.py14111 lb 20 0 598552 84140 3672 R 100.0 0.3 3:56.79 14111 nginx: worker process15422 lb 20 0 624816 83956 3404 R 100.0 0.3 0:45.09 15422 nginx: worker process
From the core file name, you can tell the crashed process is the L4 CP process (child process) using /var/log/vmware/top-mem.log.
top - xxxx up xx days, 8:11, 0 users, load average: 2.68, 1.29, 0.87Tasks: 342 total, 4 running, 338 sleeping, 0 stopped, 0 zombie%Cpu(s): 20.6 us, 3.8 sy, 0.0 ni, 75.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 stKiB Mem : 65360864 total, 7890508 free, 37557284 used, 19913072 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 16626536 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TGID COMMAND3443132 lb 20 0 8768776 111728 91212 R 100.0 0.2 0:50.86 3443132 nginx: L4LB process
3443133 lb 20 0 8768776 112264 91748 R 100.0 0.2 0:50.65 3443133 nginx: L4LB CP process
3444375 root 20 0 3264 1716 1168 R 100.0 0.0 0:50.90 3444375 /bin/gzip
9470 root 20 0 129.5g 297900 107316 S 58.8 0.5 33352:46 9470 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/va+
3444825 root 20 0 9700 3800 3092 R 11.8 0.0 0:00.04 3444825 top -n 1 -b -c -o %CPU -w180
1571 nsx-sha 20 0 4733460 88660 22360 S 5.9 0.1 1634:08 1571 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
8535 root 20 0 126996 18352 16348 S 5.9 0.0 755:09.40 8535 /opt/vmware/nsx-edge/bin/nsd
3443134 lb 20 0 8768776 3.5g 3.5g S 5.9 5.7 0:03.98 3443134 nginx: L4LB CP process
1 root 20 0 171560 13368 8488 S 0.0 0.0 17:20.62 1 /sbin/init splash
2 root 20 0 0 0 0 S 0.0 0.0 0:03.61 2 [kthreadd]
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 3 [rcu_gp]
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 4 [rcu_par_gp]
5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 5 [slub_flushwq]
VMware NSX
When the persistence profile is removed from the VIP, the rbtree and hash list of persistence in this VIP would be released. But the existing L4 connection of this VIP still gets the persistence entry, which is inserted in the hash list.
In the fix of the bug , we would traverse all the L4 connections of the VIP to unref the used persistence entry when the persistence profile is removed from the VIP, after that, the entries would be deleted from the hash list. So the hash list would not be accessed again after it is released.
There is a race condition that during the traversing, there are new L4 connection creation / deletion request arrives on the L4 CP process, so it would try to change the L4 connection data structure which the master process is modifying the data structure to unref the persistent entry. The concurrent modification on the L4 connection causes the invalid value, which cause the coredump.
Permanent fix:
NSX-T 3.2.4.3
NSX-T 4.2.2.2
NSX-T 4.2.3.1
NSX-T 9.0.1
Proactive Approach:
No log line can be seen before the crash , Hence, no proactive action can be taken for this condition.
Reactive Approach:
Once a crash has occurred and a coredump is generated, perform manual failover to reduce potential downtime as post CP crash the CPU will still continue to be high until a MP crash occurs.
The nginx process needs to be restarted completely to resolve the issue temporarily on the affected Edge nodes. There are multiple option which can be followed to perform this, below are the same. Note: anyone of them can be used.
1. Change the active edge node into maintenance mode and then exit maintenance mode
2. Restart Edge Node
3. Restart Load Balancer Docker - ps | grep edge-lb | awk '{print $1}' | xargs docker restart
Long-term Workaround
Change pool selection algorithm from round robin to IP hash and disable persistence in all virtual servers