An alarm with "The memory usage on Edge node <UUID> has reached <Current Memory Usage>% which is at or above the high threshold value of 80%." is triggered.
top command or check /var/log/vmware/top-mem.log on the edge node shows lots of nginx processes with "worker process is shutting down" state.
VMware NSX-T Data Center 3.2.3.1
1. In /var/log/vmware/top-mem.log, there might be lots of processes with "nginx: worker process is shutting down" in COMMAND column.
2. Whenever reconfiguration regarding LB happens, a new set of nginx worker processes are launched and old nginx worker processes will continue to handle the old connections.
3. Old nginx worker processes will exit when all the old connections complete but the old nginx worker processes cannot be terminated if the old connections remain alive.
4. The old nginx worker processes consume a portion of resident memory, which can result in high memory usage.
If you want to make sure there are still connections in the old nginx worker processes, you can refer to the following method to capture packets towards the LB on an edge node.
# ip netns list eg. root@edge-node-01:/# ip netns list <LOGICAL_ROUTER_ID> (id: 2) <LOGICAL_ROUTER_ID> (id: 8) underlay (id: 6) plr_sr (id: 1) <LOGICAL_ROUTER_ID> (id: 0)
# ip netns exec <LOGICAL_ROUTER_ID> netstat -tan -p You can get <LOGICAL_ROUTER_ID> based on the output of "get logical-routers" and you can specify <Logical_Router_ID> where you want to refer to. eg. edge-node-01> get logical-routers Thu Jul 11 2024 UTC 06:37:29.652 Logical Router UUID VRF LR-ID Name Type Ports Neighbors <LOGICAL_ROUTER_ID> 0 0 TUNNEL 4 6/5000 <LOGICAL_ROUTER_ID> 1 8 <LOGICAL_ROUTER_NAME> DISTRIBUTED_ROUTER_TIER1 5 0/50000 <LOGICAL_ROUTER_ID> 2 9 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER1 6 2/50000 <LOGICAL_ROUTER_ID> 3 1 <LOGICAL_ROUTER_NAME> DISTRIBUTED_ROUTER_TIER0 5 0/50000 <LOGICAL_ROUTER_ID> 4 11 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER1 5 1/50000 <LOGICAL_ROUTER_ID> 5 2 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER0 6 2/50000 <LOGICAL_ROUTER_ID> 7 2049 <LOGICAL_ROUTER_NAME> SERVICE_ROUTER_TIER1 5 0/50000
# tcpdump -i kni-lrport-0 (in case of HTTP or HTTPS) |
As a workaround:
Putting the edge node into maintenance mode can be a temporary workaround