Edge Memory Usage High alarm might appear due to lots of old nginx worker processes waiting to be shut down.
search cancel

Edge Memory Usage High alarm might appear due to lots of old nginx worker processes waiting to be shut down.

book

Article ID: 371702

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

An alarm with "The memory usage on Edge node <UUID> has reached <Current Memory Usage>% which is at or above the high threshold value of 80%." is triggered.

top command or check /var/log/vmware/top-mem.log on the edge node shows lots of nginx processes with "worker process is shutting down" state.

Environment

VMware NSX-T Data Center

Cause

1. In /var/log/vmware/top-mem.log, there might be lots of processes with "nginx: worker process is shutting down" in COMMAND column.

2. Whenever reconfiguration regarding LB happens, a new set of nginx worker processes are launched and old nginx worker processes will continue to handle the old connections.

3. Old nginx worker processes will exit when all the old connections complete but the old nginx worker processes cannot be terminated if the old connections remain alive.

4. The old nginx worker processes consume a portion of resident memory, which can result in high memory usage.

 

If you want to make sure there are still connections in the old nginx worker processes, you can refer to the following method to capture packets towards the LB on an edge node.

# ip netns list

eg. root@edge-node-01:/# ip netns list

<LOGICAL_ROUTER_ID> (id: 2)

<LOGICAL_ROUTER_ID> (id: 8)

underlay (id: 6)

plr_sr (id: 1)

<LOGICAL_ROUTER_ID> (id: 0)

 

# ip netns exec <LOGICAL_ROUTER_ID> netstat -tan -p

You can get <LOGICAL_ROUTER_ID> based on the output of "get logical-routers" and you can specify <Logical_Router_ID> where you want to refer to.

eg. edge-node-01> get logical-routers

Thu Jul 11 2024 UTC 06:37:29.652

Logical Router

UUID                                   VRF    LR-ID  Name                              Type                        Ports   Neighbors

<LOGICAL_ROUTER_ID>   0      0                                        TUNNEL                      4       6/5000

<LOGICAL_ROUTER_ID>   1      8      <LOGICAL_ROUTER_NAME>                       DISTRIBUTED_ROUTER_TIER1    5       0/50000

<LOGICAL_ROUTER_ID>   2      9      <LOGICAL_ROUTER_NAME>                       SERVICE_ROUTER_TIER1        6       2/50000

<LOGICAL_ROUTER_ID>   3      1      <LOGICAL_ROUTER_NAME>                      DISTRIBUTED_ROUTER_TIER0    5       0/50000

<LOGICAL_ROUTER_ID>   4      11      <LOGICAL_ROUTER_NAME>                        SERVICE_ROUTER_TIER1        5       1/50000

<LOGICAL_ROUTER_ID>   5      2       <LOGICAL_ROUTER_NAME>                       SERVICE_ROUTER_TIER0        6       2/50000

<LOGICAL_ROUTER_ID>   7      2049    <LOGICAL_ROUTER_NAME>             SERVICE_ROUTER_TIER1        5       0/50000

 

# tcpdump -i kni-lrport-0 (in case of HTTP or HTTPS)

Resolution

As a workaround:

Putting the edge node into maintenance mode can be a temporary workaround