Symptoms:
Relevant log location
Log Indicating that a nginx coredump is generated
/var/log/syslog
2022-08-13T11:04:24.986Z edge-02.corp.local NSX 22492 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.nginx.1660388664.16937.134.11.gz
Log indicating the load-balancer CPU usage is very high just after the generation of the coredump
/var/log/syslog
2022-08-13T11:09:12.986Z edge-02.corp.local NSX 2781 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="a8xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx71" tid="3145" level="WARNING" eventState="On" eventFeatureName="load_balancer" eventSev="warning" eventType="lb_cpu_very_high"] The CPU usage of load balancer a8xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx71 is very high. The threshold is 95%.
edge-02> get processes
top - 09:56:10 up 34 days, 17:09, 0 users, load average: 2.86, 2.16, 1.73
Tasks: 268 total, 4 running, 166 sleeping, 0 stopped, 14 zombie
%Cpu(s): 2.6 us, 2.6 sy, 0.0 ni, 94.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32734844 total, 6939616 free, 18403544 used, 7391684 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13488368 avail Mem
/opt/vmware/nsx-netopa/bin/agent.py
14111 lb 20 0 598552 84140 3672 R 100.0 0.3 3:56.79 14111 nginx: worker process
15422 lb 20 0 624816 83956 3404 R 100.0 0.3 0:45.09 15422 nginx: worker process
VMware NSX-T Data Center
The shared memory has been removed from L4LB CP nginx but the queue node pointer of this persistence session is still pointing to the address in this removed shared memory.
This means the persistence session is still in the queue of this removed shared memory. When the persistence session is freed, we try to remove it from the list so a crash of the load balancer happens.
Upgrade to NSX-T 3.2.1.2 or 4.1.0 and greater
Workaround:
Temporary Workarounds
The nginx process needs to be restarted completely to resolve the issue temporarily on the affected Edge nodes. There are multiple option which can be followed to perform this, below are the same. Note: anyone of them can be used.
1. Change the active edge node into maintenance mode and then exit maintenance mode
2. Restart Edge Node
3. Restart Load Balancer Docker - ps | grep edge-lb | awk '{print $1}' | xargs docker restart
Long-term Workaround
Change pool selection algorithm from round robin to IP hash and disable persistence in all virtual servers