An alarm with "The memory usage on Edge node <UUID> has reached <Current Memory Usage>% which is at or above the high threshold value of 80%." is triggered.
The load balancer Virtual server status may be down.
The load balancer configuration changes frequently.
On the edge node where the load balancer is located, running 'get process monitor' as admin user shows a number of nginx worker processes shutting down:
Edge>get process monitor
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TGID COMMAND
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
X lb ############################################## nginx: worker process is shutting down
/var/log/syslog Nginx process is killed due to Edge being out of memory:
kernel - - - [] Out of memory: Killed process x (nginx) total-vm:, anon-rss:, file-rss:4kB, shmem-rss:kB, UID: pgtables:kB oom_score_adj:0
kernel - - - [] Out of memory: Killed process x (nginx) total-vm:, anon-rss:, file-rss:0kB, shmem-rss:kB, UID: pgtables:kB oom_score_adj:0
/var/log/syslog, you observer errors similar to:
- - - cfg: not enough mem available for new config processing, min needed XXXXXXXXX KB, current free XXXXXXXX KB
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
VMware NSX-T Data Center
VMware NSX
When a Load Balancer is reconfigured, the nginx worker process quits and a new process is forked, however, the old worker process only quits after all live connections handled by this process are closed.
This is used to ensure live connections are not broken. If the old connections persist (due to TCP Keepalive) then the process that handles these connections will also persist a with a status "worker process is shutting down".
The combination of LB configuration updates and keepalive connections results in these worker processes not shutting down, therefore more nginx worker processes exist and consume the edge nodes memory, which leads to the high memory alarm.
This is a known issue impacting VMware NSX.
There are three options available to work around this issue, take backup before proceeding:
Workaround 1:
LB_connection_dump.py' can be used when the Edge high memory alarm is raised and nginx processes are in a shutting down state. /image directory, as root user.# python LB_connection_dump.py
Sample Output :
Collecting the PIDs in shutting down state
Total PIDs identified: 1
Collecting the active sessions on the respective PIDs
Connections saved to connections_2025-02-13_09-43-58.txt
# cat connections_2025-02-13_09-43-58.txt
Example output:
Timestamp: 2025-02-13_09-43-58
Total PIDs identified: 1
PID Detail: lb 3049407 0.0 0.6 281516 53600 ? S Feb12 0:25 nginx: worker process is shutting down
PID: 3049407
tcp 0 0 <LB VIP>:80 <Source-IP>:40352 ESTABLISHED 3049407/nginx: work >>>>>This shows the Traffic sourced from <Source-IP>:40352 and towards <LB VIP>:80 is still associated with the worker process and killing the process will kill this connection as well.
tcp 0 0 <LB_downlink_internal_ip>:4111 <Backend pool member>:80 ESTABLISHED 3049407/nginx: work >>This is the second leg of the connection towards the backend server
# kill -9 <PID>
NOTE: Killing the process will cause all associated connection to be Lost. Please take the action cautiously.
Workaround 2:
Workaround 3:
If you want to make sure there are still connections in the old nginx worker processes, you can refer to the following method to capture packets towards the LB on an edge node.
# ip netns list
eg. root@edge-node-01:/# ip netns list
<LOGICAL_ROUTER_ID> (id: 2)
<LOGICAL_ROUTER_ID> (id: 8)
underlay (id: 6)
plr_sr (id: 1)
<LOGICAL_ROUTER_ID> (id: 0)
# ip netns exec <LOGICAL_ROUTER_ID> netstat -tan -p
You can get <LOGICAL_ROUTER_ID> based on the output of "get logical-routers" and you can specify <Logical_Router_ID> where you want to refer to.
eg. edge-node-01> get logical-routers
Thu Jul 11 2024 UTC 06:37:29.652
Logical Router
UUID VRF LR-ID Name Type Ports Neighbors
<LOGICAL_ROUTER_ID> 0 0 TUNNEL 4 6/5000
<LOGICAL_ROUTER_ID> 1 8 <LOGICAL_ROUTER_NAME> DISTRIBUTED_ROUTER_TIER1 5 0/50000
<LOGICAL_ROUTER_ID> 2 9 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER1 6 2/50000
<LOGICAL_ROUTER_ID> 3 1 <LOGICAL_ROUTER_NAME> DISTRIBUTED_ROUTER_TIER0 5 0/50000
<LOGICAL_ROUTER_ID> 4 11 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER1 5 1/50000
<LOGICAL_ROUTER_ID> 5 2 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER0 6 2/50000
<LOGICAL_ROUTER_ID> 7 2049 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER1 5 0/50000
# tcpdump -i kni-lrport-0 (in case of HTTP or HTTPS)