NSX-T Edge Node CPU utilization spikes after upgrade to 3.1.3
search cancel

NSX-T Edge Node CPU utilization spikes after upgrade to 3.1.3

book

Article ID: 322565

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • An increase in Edge node CPU utilization is noticed after upgrading the NSX-T Edge node to 3.1.3 and before upgrading the NSX-T Manager nodes.
  • Syslog logs (/var/log/syslog) of the NSX-T Edge nodes if flooded with the below message(s):

[nsx@6876 comp="nsx-edge" subcomp="nsx-nestdb" s2comp="nsx-net" tid="1835" level="ERROR" errorCode="NET4"] NetTransport[0] Accept on endpoint 'unix:///var/run/vmware/nestdb/nestdb-server.sock' failed with error 24-Too many open files

  • Spike in CPU consumption is noticed for nestdb-server in the output of top command run on the NSX-T edge node as root:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1634 nestdb 20 0 584372 523472 14388 S 96.4 1.6 339:59.99 nestdb-server

  • We see a large number of open files are opened for nestdb-server and nvpapi.py, to check this as root on the NSX-T edge node run the following command:

root@edge-node:/tmp# lsof +c 0 | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -20
150063 1634 nestdb-server
100128 3942 nvpapi.py
5888 2365 python3
5024 7288 datapathd


Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

The NSX Manager pushes the collector configuration, for example from vRNI, to the Edge nodes.
After upgrading the NSX-T Edge node to 3.1.3, the NSX-T Edge node expects 3 pieces on information about the collector; IP, Port Number and Type.
However the Manager Node prior to upgrade only sends two pieces of information: IP and Port Number.
Due to this missing piece of information, the NSX-T Edge node will continuously retries RPC connections, each failure results in a file open and thus leading to this file open exhaustion issue.

The below API call can be used to verify the Collector information:

curl -i -k -u 'admin:<PW>' -H "Content-Type:application/json" -X GET https://<nsx-mgr-ip>/api/v1/global-configs/OperationCollectorGlobalConfig

Resolution

This issue is resolved in NSX-T Data Center 3.1.3, once the NSX-T management plane is completely upgraded, you will need to restart the service:
service nsx-edge-api-server restart
This is in order to clear any open file descriptors that may have accumulated.

Workaround:

If there is a long gap between the NSX-T edge node upgrade and NSX-T manager node upgrade and you encounter this issue, disable the collector configuration, this can be done via the log collection utility for example within vRNI and execute the below command on the impacted NSX-T Edge node while logged in as root:

service nsx-edge-api-server restart

Alternatively, before you start the NSX-T upgrade, you can clear the collector information from use the API above and remove the collector information and after NSX-T is completely upgraded, you can re-apply the collector configuration again.