open-vm-tools memory leak causing very high NSX Edge memory usage
search cancel

open-vm-tools memory leak causing very high NSX Edge memory usage

book

Article ID: 324402

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

The purpose of this article is to provide awareness of a known issue where memory Leak can be seen with open-vm tools that is addressed in 6.4.8 release of NSX via 11.0.5 version of open-vm tools

Symptoms:
1. High Memory usage reported on Edge ~90%, followed by "out of memory" errors
2. Edge becomes unresponsive and unmanageable
3. Edge Automatically reboots
4. Critical warnings reported in UI (NSX Edge is out of memory. The Edge is rebooting in 3 seconds.
Top 5 processes are: {#}.)

Environment

VMware NSX Data Center for vSphere 6.4.x

Cause

This issue is caused by Open-VM tools running on NSX Edge that can cause high memory usage due to memory leak
Eventually NSX Edge becomes unmanageable and reboots automatically as a part of auto-recovery. 
Sometime Edge needs to be be manually rebooted to clear high memory usage if an automatic reboot doesn't occur.

Verify this issue by below: (Log snippets vary from NSX  Edge to Edge/Version to Version)
1. Critical Alarms shown in UI (NSX Edge is out of memory. The Edge is rebooting in 3 seconds. Top 5 processes are: {#}.)
2. Check NSX Edge logs and verify memory usage warnings

2020-06-06T05:22:55+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] payload len:368 data:{"systemEvents":[{"moduleName":"vShield Edge Appliance","severity":"Critical","eventCode":"30149","message":"vShield Edge memory over used","timestamp":1591420975,"metaData":{"message":"Memory usage: 90.05%","details":" 1772 390456 987680 vmtoolsd 801 4156 200708 syslog-ng 7652 3716 67060 sync_path.pl 7406 2808 14144 sh 7382 2752 14140 runevery.sh "}}]
2020-06-06T05:26:18+00:00 NSX Edge kernel[]: [default]: [kern.warning] dcsms invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

2020-06-06T05:26:18+00:00 NSX Edge kernel[]: [default]: [kern.err] Out of memory: Kill process 1772 (vmtoolsd) score 817 or sacrifice child
2020-06-06T05:26:18+00:00 NSX Edge kernel[]: [default]: [kern.err] Killed process 1772 (vmtoolsd) total-vm:987756kB, anon-rss:384548kB, file-rss:244kB
2020-06-06T05:26:18+00:00 NSX Edge OOMChecker[1780]: [default]: [daemon.warning] OOM, top 5 memory used processes: 8347 54180 116972 VseEventProcess 801 3824 200708 syslog-ng 8349 2780 14144 sh 1780 2628 39876 VseOOMChecker.p 7652 2548 67060 sync_path.pl
2020-06-06T05:26:18+00:00 NSX Edge config[]: [default]: [daemon.info] INFO :: Utils :: ha: UpdateHaResourceFlags:
2020-06-06T05:26:18+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] Building event message
2020-06-06T05:26:18+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] correlation id:Event_502a1cba-7f73-2be8-49a6-5b96ce953aaf1591421178
2020-06-06T05:26:18+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] payload len:360 data:{"systemEvents":[{"severity":"Critical","message":"OOM happened, system rebooting in 3 seconds...","metaData":{"message":" 8347 54180 116972 VseEventProcess 801 3824 200708 syslog-ng 8349 2780 14144 sh 1780 2628 39876 VseOOMChecker.p 7652 2548 67060 sync_path.pl "},"timestamp":1591421178,"eventCode":30180,"moduleName":"vShield Edge Appliance"}]}
2020-06-06T05:26:21+00:00 NSX Edge shutdown[8425]: [default]: [user.notice] shutting down for system reboot

vsm.log 

2020-06-06 13:28:18.992 XXX INFO SimpleAsyncTaskExecutor-1 EventServiceImpl:119 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] [SystemEvent] Time:'Sat Jun 06 13:27:21.000 xxx 2020', Severity:'Informational', Event Source:'edge-xxxxxxxx-xxxxxxxxx-xxxx-xxxxxxxxxxxx', Code:'30101', Event Message:'NSX Edge was booted', Module:'vShield Edge Appliance', Universal Object:'false


Edge System Process  via TOP command

USER   PID PPID %CPU %MEM  VSZ RSS NI TTY   STAT STIME  TIME COMMAND
root   1175 1147 0.0 67.3 943376 335868 0 ?    Sl  2018 07:05:47 /usr/local/bin/vmtoolsd --plugin-path=/usr/local/lib/open-vm-tools/plugins/vmsvc/

Resolution

Currently, the resolution is via Open-VM tools version 11.0.5 which is shipped in NSX 6.4.8 release.
Customer must upgrade NSX Manager and other components to 6.4.8 version

Workaround:
The only workaround is to reboot the Edge when there is a warning shown in UI for Memory usage i.e. 
It gets critical when you see that Edge is constantly reporting high memory usage.
A critical alarm shall be generated something like i.e. Alarm 30180 OOM, If you are seeing this alarm in UI, that means OOM has occurred and system will try to recover it via an automatic reboot

You can find list of critical alarms and system event in -- https://docs.vmware.com/en/VMware-NSX-Data-Center-for-vSphere/6.4/com.vmware.nsx.logging.doc/GUID-4CAA25F7-1EE7-4B8A-957E-52865F723C10.html

Additional Information

https://docs.vmware.com/en/VMware-NSX-Data-Center-for-vSphere/6.4/com.vmware.nsx.logging.doc/GUID-4CAA25F7-1EE7-4B8A-957E-52865F723C10.html

Impact/Risks:
NSX Edge will not be manageable and services will impact.