Memory leaks in NSX-T Edge
search cancel

Memory leaks in NSX-T Edge

book

Article ID: 311847

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
Edge node running out of memory. `top` shows kauditd using high cpu, /proc/slabinfo shows kmalloc-2048 usage high, and syslog keeps showing "audit: audit_backlog=xxxx > audit_backlog_limit=8192"

proc/slabinfo:
...
kmalloc-2048 659406 659406 2048 16 8 : tunables 0 0 0 : slabdata 41268 41268 0
...

var/log/vmware/top-cpu.log:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+  TGID COMMAND
   34 root 20 0   0    0   0  R 87.5  0.0 11650:31 34 [kauditd]

var/log/syslog:
...
<yyyy-mm-dd>T<hr:min:sec.xxx>Z <hostname> kernel - - - [xxxxxxxxx.xxxxxx] audit: audit_backlog=xxxxxx > audit_backlog_limit=8192
<yyyy-mm-dd>T<hr:min:sec.xxx>Z <hostname> kernel - - - [xxxxxxxxx.xxxxxx] audit: audit_lost=xxxxxx audit_rate_limit=0 audit_backlog_limit=8192
<yyyy-mm-dd>T<hr:min:sec.xxx>Z <hostname> kernel - - - [xxxxxxxxx.xxxxxx] audit: backlog limit exceeded
...


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This is a Linux kernel bug. It's unclear what the trigger is and likely related to high system load + high rate of audit events. System will be running out of memory and start killing processes. The only way to recover is reboot the edge node.

Resolution

This issue is resolved in NSX-T 3.2.3.

Workaround:
1. Open the file /etc/default/grub.
2. Look for the line GRUB_CMDLINE_LINUX="audit=1". Remove or comment this line and save the file.
3. Run update-grub2
4. Reboot
5. After reboot, check the string audit=1 is no longer in `cat /proc/cmdline`