NSX-T Edge Load Balancer crashes with high CPU producing multiple ngnix core dumps on Edge

search cancel

NSX-T Edge Load Balancer crashes with high CPU producing multiple ngnix core dumps on Edge - LB applications impacted

book

Article ID: 322040

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:

NSX-T Data Center version < 3.2.1.2
NSX-T Load Balancer configured with source IP persistence profile
UI reports high CPU usage on the Edge
nginx core dumps generated followed by High or 100% CPU on the nginx worker process
Clients cannot connect to the LB backend servers
The LB operation process spikes to 100%

Relevant log location
Log Indicating that a nginx coredump is generated
/var/log/syslog
2022-08-13T11:04:24.986Z edge-02.corp.local NSX 22492 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.nginx.1660388664.16937.134.11.gz

Log indicating the load-balancer CPU usage is very high just after the generation of the coredump
/var/log/syslog
2022-08-13T11:09:12.986Z edge-02.corp.local NSX 2781 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="a8xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx71" tid="3145" level="WARNING" eventState="On" eventFeatureName="load_balancer" eventSev="warning" eventType="lb_cpu_very_high"] The CPU usage of load balancer a8xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx71 is very high. The threshold is 95%.

edge-02> get processes
top - 09:56:10 up 34 days, 17:09, 0 users, load average: 2.86, 2.16, 1.73
Tasks: 268 total, 4 running, 166 sleeping, 0 stopped, 14 zombie
%Cpu(s): 2.6 us, 2.6 sy, 0.0 ni, 94.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32734844 total, 6939616 free, 18403544 used, 7391684 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13488368 avail Mem

/opt/vmware/nsx-netopa/bin/agent.py
14111 lb 20 0 598552 84140 3672 R 100.0 0.3 3:56.79 14111 nginx: worker process
15422 lb 20 0 624816 83956 3404 R 100.0 0.3 0:45.09 15422 nginx: worker process

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

The shared memory has been removed from L4LB CP nginx but the queue node pointer of this persistence session is still pointing to the address in this removed shared memory.

This means the persistence session is still in the queue of this removed shared memory. When the persistence session is freed, we try to remove it from the list so a crash of the load balancer happens.

Resolution

Upgrade to NSX-T 3.2.1.2 or 4.1.0 and greater

Workaround:
Temporary Workarounds

The nginx process needs to be restarted completely to resolve the issue temporarily on the affected Edge nodes. There are multiple option which can be followed to perform this, below are the same. Note: anyone of them can be used.

1. Change the active edge node into maintenance mode and then exit maintenance mode
2. Restart Edge Node
3. Restart Load Balancer Docker - ps | grep edge-lb | awk '{print $1}' | xargs docker restart

Long-term Workaround
Change pool selection algorithm from round robin to IP hash and disable persistence in all virtual servers

Additional Information

Impact/Risks:

Load Balancer crashes

The persistence table is not unlocked due to Load Balancer crash making the CPU usage very high in the new L4LB process.

Some persistence session nodes may be lost. They are not in any queue and out of management. So the total number of persistence entries cannot reach the capacity of this load balancer size.

Some new connections may go into different backend servers

Feedback

thumb_up Yes

thumb_down No