NSX Manager /var/log partition reaches 100% disk usage due to uncompressed rolled logs

Products

VMware NSX

Issue/Introduction

/var/log on one or more NSX Manager nodes reaches very high.

Filesystem Size Used Avail Use% Mounted on
tmpfs 9.5G 1.3M 9.5G 1% /run
/dev/sda3 11G 4.2G 5.6G 43% /
tmpfs 48G 4.5M 48G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/mapper/nsx-repository 31G 12G 19G 39% /repository
/dev/mapper/nsx-tmp 9.6G 169M 9.0G 2% /tmp
/dev/mapper/nsx-secondary 98G 5.8G 88G 7% /nonconfig
/dev/mapper/nsx-var+dump 20G 24K 19G 1% /var/dump
/dev/mapper/nsx-var+log 37G 32G 2.9G 92% /var/log
/dev/sda1 942M 7.2M 870M 1% /boot
/dev/mapper/nsx-config__bak 29G 3.0G 25G 11% /config_bak
/dev/mapper/nsx-config 29G 1.4G 27G 6% /config
/dev/mapper/nsx-image 62G 20G 40G 33% /image
tmpfs 9.5G 8.0K 9.5G 1% /run/user/1007
tmpfs 9.5G 8.0K 9.5G 1% /run/user/0

Verification of the /var/log partition reveals multiple log files in a uncompressed state.

Example: /var/log/proton# ls -lrt nsxapi*

-rw-r----- 1 uproton uproton 262145339 Feb 8 06:49 nsxapi.60.log
-rw-r----- 1 uproton uproton 262151464 Feb 8 07:04 nsxapi.59.log
-rw-r----- 1 uproton uproton 262144771 Feb 8 07:24 nsxapi.58.log
-rw-r----- 1 uproton uproton 262144117 Feb 8 07:38 nsxapi.57.log
-rw-r----- 1 uproton uproton 262144176 Feb 8 08:21 nsxapi.56.log
-rw-r----- 1 uproton uproton 262144614 Feb 8 08:38 nsxapi.55.log
-rw-r----- 1 uproton uproton 262144068 Feb 8 09:32 nsxapi.54.log
-rw-r----- 1 uproton uproton 262144355 Feb 8 09:48 nsxapi.53.log
-rw-r----- 1 uproton uproton 262144273 Feb 8 10:09 nsxapi.52.log
-rw-r----- 1 uproton uproton 262144149 Feb 8 10:18 nsxapi.51.log
-rw-r----- 1 uproton uproton 262144435 Feb 8 10:45 nsxapi.50.log
-rw-r----- 1 uproton uproton 262145156 Feb 8 11:32 nsxapi.49.log
-rw-r----- 1 uproton uproton 262144092 Feb 8 11:46 nsxapi.48.log
-rw-r----- 1 uproton uproton 262144137 Feb 8 11:57 nsxapi.47.log
-rw-r----- 1 uproton uproton 262144072 Feb 8 12:04 nsxapi.46.log
-rw-r----- 1 uproton uproton 262144142 Feb 8 12:09 nsxapi.45.log
-rw-r----- 1 uproton uproton 262144835 Feb 8 12:18 nsxapi.44.log
-rw-r----- 1 uproton uproton 262144298 Feb 8 12:33 nsxapi.43.log

NSX Manager reports manager_health.manager_disk_usage_high or manager_health.manager_disk_usage_very_high indicating the log partition has exceeded capacity thresholds.

CRITICAL NSX 3133 [nsx@4413 comp="nsx-manager" subcomp="node-mgmt" username="root" level="CRITICAL" eventFeatureName="manager_health" eventType="manager_disk_usage_very_high" eventSev="critical" eventState="On" entId="########" logger="nsx_monitoring.clientlibrary.event_source"] At the time this alarm was raised, the disk usage for the Manager node disk partition /var/log reached 90% which is at or above the very high threshold value of 90%.

Environment

VMware NSX

Cause

This issue is caused by the manual uncompression of rolled NSX log files (e.g., .gz archives) directly within the /var/log directory of the NSX Manager.

When these files are manually unzipped, the resulting '.log' files are no longer recognized by the automated rotation and re-compression routines. Consequently, these uncompressed files remain on the file system indefinitely, growing in size until the /var/log partition reaches capacity, which may lead to management plane instability. This manual modification of the log structure is considered an unsupported administrative action.

Resolution

To fix this issue in the live setup, please follow the below steps:

Locate the log repository containing the uncompressed files. For example, if the Proton service logs were uncompressed, navigate to /var/log/proton/.
Identify files following the pattern <log_name>.<number>.log.Example: nsxapi.1.log, nsxapi.2.log, through nsxapi.20.log.
Perform a backup of these files to an external location if needed for audit or troubleshooting. Once backed up, delete the uncompressed numbered logs.
Confirm the directory is clean. Using the Proton example, only the active nsxapi.log and any valid .gz archives should remain in /var/log/proton/. Ensure no other nsxapi.*.log files exist.

Caution: Do not delete the active log file currently being written to (e.g., nsxapi.log). Only remove the uncompressed historical logs.

Prevention: Do not unzip any log files on the manager under /var/log/. If analysis is needed, copy the log files off the node and unzip them elsewhere.
Monitoring: Use existing alarms manager_health.manager_disk_usage_high and manager_health.manager_disk_usage_very_high for /var/log.