NSX Manager disk usage high alarm and partition /tmp shows above 90%

Products

VMware NSX

Issue/Introduction

You may see NSX alarm similar to the below:

Manager Health         Manager Disk Usage High        Medium        <TimeStamp>
-------------------------------------------------------------------------------
Description            The disk usage for the Manager node disk partition /tmp has reached 90% which is at or above the high threshold value of 90%
Recommended Action     Examine the partition with high usage and see if there are any unexpected large files that can be removed.

df -h output shows /tmp partition usage above 90% but you may not find any files that consuming large space.

df -h

Filesystem                    Size    Used    Avail    Use%    Mounted on
tmpfs                         2.4G    1.4M    2.4G     1%    /run
/dev/sda3                      11G    4.8G    4.9G    50%    /
tmpfs                          12G    3.8M     12G     1%    /dev/shm
tmpfs                         5.0M       0    5.0M     0%    /run/lock
/dev/sda1                     942M    7.1M    870M     1%    /boot
/dev/mapper/nsx-config__bak    29G     54M     28G     1%    /config_bak
/dev/mapper/nsx-config         29G     51M     28G     1%    /config
/dev/mapper/nsx-secondary      98G    614M     93G     1%    /nonconfig
/dev/mapper/nsx-image          42G    590M     40G     2%    /image
/dev/mapper/nsx-repository     31G    8.9G     21G    31%    /repository
/dev/mapper/nsxvar+dump       9.3G     24K    8.8G     1%    /var/dump
/dev/mapper/nsx-tmp           3.7G    3.3G    225M    94%    /tmp
/dev/mapper/nsx-var+log        27G    9.2G     17G    37%    /var/log
tmpfs                         2.4G    4.0K    2.4G     1%    /run/user/1007
tmpfs                         2.4G    4.0K    2.4G     1%    /run/user/0

du -hsx /tmp/* | sort -rh | head -15 

68K /tmp/hsperfdata_nsx-replicator 
68K /tmp/hsperfdata_corfu 
36K /tmp/hsperfdata_uuc 
36K /tmp/hsperfdata_uproxy 
36K /tmp/hsperfdata_uproton 
36K /tmp/hsperfdata_uphc 
36K /tmp/hsperfdata_ucminv 
36K /tmp/hsperfdata_nsx-search 
36K /tmp/hsperfdata_nsx-messaging 
36K /tmp/hsperfdata_nsx-idps 
36K /tmp/hsperfdata_nsx-cbm 
36K /tmp/hsperfdata_nsx 
8.0K /tmp/systemd-private-29c4de################ac9ab222b1-systemd-timedated.service-Bquhkg 
8.0K /tmp/systemd-private-29c4de################ac9ab222b1-systemd-resolved.service-bQlOqM 
8.0K /tmp/systemd-private-29c4de################ac9ab222b1-systemd-logind.service-krDktF

You see that there are large files that have been deleted but are still opened under /tmp.

lsof +L1 /tmp

COMMAND       PID           USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
java         3568            uuc   59w   REG  252,6  8388665     0 1253385 /var/log/upgrade-coordinator/upgrade-coordinator.1.log (deleted)
java         3568            uuc   60w   REG  252,6 10485836     0 1253397 /var/log/upgrade-coordinator/corfu-metrics.1.log (deleted)
java         4095       nsx-idps  mem-W  REG  252,2    32768     1      43 /tmp/hsperfdata_nsx-idps/4095
java         4105           uphc  mem-W  REG  252,2    32768     1      44 /tmp/hsperfdata_uphc/4105
java         4105           uphc   59w   REG  252,6 11534563     0 1114175 /var/log/phonehome-coordinator/phonehome-coordinator.1.log (deleted)
java         4105           uphc   61w   REG  252,6 10485903     0 1114184 /var/log/phonehome-coordinator/corfu-metrics.1.log (deleted)
java         4105           uphc   62u   REG  252,6  1048642     0 1114127 /var/log/phonehome-coordinator/spring.1.log (deleted)
java         4174          corfu  mem-W  REG  252,2    32768     1      41 /tmp/hsperfdata_corfu/4174
java         4944         ucminv  mem-W  REG  252,2    32768     1      49 /tmp/hsperfdata_ucminv/4944
java         4944         ucminv   59u   REG  252,6 10485938     0  835659 /var/log/search/search-inventory.1.log (deleted)
java         4944         ucminv   61w   REG  252,6 31457318     0 1531988 /var/log/cm-inventory/corfu-metrics.1.log (deleted)
java         4944         ucminv   62w   REG  252,6  2097207     0 1532120 /var/log/cm-inventory/nsx-mp-metrics.1.log (deleted)
java         4944         ucminv   63w   REG  252,6 92280059     0 1532001 /var/log/cm-inventory/cm-inventory.1.log (deleted)

Environment

VMware NSX 4.x
VMware NSX-T Data Center 3.x

Cause

A file got rotated, cleaned or manually deleted but the process never closed it, so the file is still using disk space. These types of files are typically only identifiable via the lsof command.

Resolution

This is a condition that may occur in a VMware NSX environment.

Workaround:

Reboot the affected NSX manager node to resolve the issue.

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:
- NSX Manager support bundles.
- ESXi host support bundles for hosts that are failing to configure as transport nodes.
- Text of any error messages seen in NSX GUI or command lines pertinent to the investigation.
Handling Log Bundles for offline review with Broadcom support: