All 3 NSX Manager nodes are seeing /var/log usage increase up to 100%

Products

VMware NSX

Issue/Introduction

Files located within /var/log or subdirectories of /var/log are increasing in size without the compression and rotation that is normally seen.
[root@nsxmgr-01a -]# df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 4.8G 1.4M 4.8G 1% /run
/dev/sda2 11G 4.9G 4.9G 50% /
tmpfs 24G 4.0M 24G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda1 943M 7.1M 871M 1% /boot
/dev/mapper/nsx-image 43G 1.5G 39G 4% /image
/dev/mapper/nsx-config__bak 29G 65M 28G 1% /config
/dev/mapper/nsx-config 29G 55M 28G 1% /config_bak
/dev/mapper/nsx-secondary 99G 1.7G 92G 2% /nonconfig
/dev/mapper/nsx-repository 31G 9.3G 20G 32% /repository
/dev/mapper/nsx-tmp 3.7G 3.5M 3.5G 1% /tmp
/dev/mapper/nsx-var+dump 9.3G 24K 8.8G 1% /var/dump
/dev/mapper/nsx-var+log 27G 26G 257M 100% /var/log
tmpfs 4.8G 4.0K 4.8G 1% /run/user/1007

Example:
Within /etc/logrotate.d/nsx-audit and /etc/logrotate.d/rsyslog, you will find the log rotation configuration for both the nsx-audit.log file as well as the syslog file. Relevant output below:
/var/log/nsx-audit.log
{
rotate 10
size 1M
}

/var/log/syslog
{
rotate 56
size 32M
compress
}
- Below you can see what the nsx-audit.log files should look like in a normal situation:
  -rw-rw-r-- 1 m##07##0 om_vmware_support 416K Nov 24 12:29 nsx-audit.log
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.1
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.2M Nov 24 12:29 nsx-audit.log.10
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.2
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.3
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.2M Nov 24 12:28 nsx-audit.log.4
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.2M Nov 24 12:29 nsx-audit.log.5
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.6
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.7
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.2M Nov 24 12:29 nsx-audit.log.8
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.2M Nov 24 12:28 nsx-audit.log.9
  There are 10 files, roughly 1M in size or a little over. These are numbered 1 through 10. The current log file is observed to be smaller, as it is still filling up. This matches what the files should look like when compared to the nsx-audit.log logrotate.d configuration seen above.
- The following is similar to what is seen instead, using syslog as an example. When comparing to the syslog logrotate.d configuration seen above, we see some discrepancies. The main observation here is that instead of a maximum file size of roughly 32M, the log file being filled currently is 1.7G's in size, substantially larger than its log rotation configuration should allow. You may also note that there are less numbered files than expected:
  -rw-rw-r-- 1 m##07##0 om_vmware_support 1.7G Nov 24 12:29 syslog
  -rw-rw-r-- 1 m##07##0 om_vmware_support 33M Nov 24 12:29 syslog.1
You had an issue with your NTP infrastructure recently, which caused time synchronization between network devices to break. You may or may not be aware of this NTP event.
You may see log messages similar to the below (note that the year 2142 seen below could be a different number):
- rsyslog.log:
  1970-01-01T00:00:00.814Z nsxmgr1.abc-corp.org rsyslogd - - - syslogTime2time_t: invalid year 2142 in timestamp - returning 1970-01-01 instead [v8.2308.0]
  1970-01-01T00:00:00.290Z nsxmgr1.abc-corp.org rsyslogd - - - message repeated 498 times: [syslogTime2time_t: invalid year 2142 in timestamp - returning 1970-01-01 instead [v8.2308.0]]
  1970-01-01T00:00:00.437Z nsxmgr1.abc-corp.org rsyslogd - - - rsyslogd[internal_messages]: 3556 messages lost due to rate-limiting (500 allowed within 5 seconds)
- Multiple different log files displaying messages stating: "The timer fell behind the system clock by <number>ms."
  These may be found within /search/search-wrapper.log, /idps-reporting/tanuki.log, upgrade-coordinator-tomcat-wrapper.log, phonehome-coordinator-tomcat-wrapper.log, and potentially others.
  proxy-tomcat-wrapper.log:
  INFO | wrapper | 2025/11/07 21:06:42 | The timer fell behind the system clock by 839909076ms.
  INFO | jvm 1 | 2025/11/07 21:06:42 | WrapperManager: The timer fell behind the system clock by 839,909,176 ms
The invalid date observed within rsyslog may be seen as the time stamp for some of the messages found within the log files I mentioned above.
phonehome-coordinator-tomcat-wrapper.log:
STATUS | wrapper | 2142/05/01 00:57:00 | Launching a JVM...
STATUS | wrapper | 2142/05/01 01:41:17 | JVM received a signal SIGKILL (9).
STATUS | wrapper | 2142/05/01 01:41:17 | JVM process is gone.
STATUS | wrapper | 2142/05/01 01:41:17 | JVM process exited with a code of 1, setting the Wrapper exit code to 1.
ERROR | wrapper | 2142/05/01 01:41:17 | JVM exited unexpectedly.
STATUS | wrapper | 2142/05/01 01:41:17 | TERM trapped. Shutting down.
STATUS | wrapper | 2142/05/01 01:41:17 | <-- Wrapper Stopped
STATUS | wrapper | 2142/05/01 01:41:48 | --> Wrapper Started as Daemon
STATUS | wrapper | 2142/05/01 01:41:48 |
STATUS | wrapper | 2142/05/01 01:41:55 | Launching a JVM...
STATUS | wrapper | 2142/05/01 02:31:26 | TERM trapped. Shutting down.
ERROR | wrapper | 2142/05/01 02:32:01 | Shutdown failed: Timed out waiting for signal from JVM.
STATUS | wrapper | 2142/05/01 02:32:01 | Dumping JVM state.
Within the syslog and/or kern.log files, you may see messages similar to below, referring to processes running out of memory:

2025-11-22T19:49:26.849Z nsxmgr1.abc-corp.org kernel - - - [1125633.422729] cat invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
2025-11-22T19:49:26.860Z nsxmgr1.abc-corp.org kernel - - - [1125633.422849] [<ffffffff811dce2a>] oom_kill_process+0x12a/0x130 ffffc90006ce7bf8
2025-11-22T19:49:26.864Z nsxmgr1.abc-corp.org kernel - - - [1125633.422860] [<ffffffff81235be7>] __alloc_pages_may_oom+0x147/0x260 ffffc90006ce7c58
2025-11-22T19:49:26.950Z nsxmgr1.abc-corp.org - - - [1125633.423129] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
2025-11-22T19:49:26.127Z nsxmgr1.abc-corp.org - - - [1125633.423384] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(nil),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/upgrade-coordinator.service,task=java,pid=3870,uid=112
2025-11-22T19:49:26.127Z nsxmgr1.abc-corp.org - - - [1125633.423606] Out of memory: Killed process 3870 (java) total-vm:2643348kB, anon-rss:1113764kB, file-rss:0kB, shmem-rss:0kB, UID:112 pgtables:2668kB oom_score_adj:500
2025-11-22T19:49:26.174Z nsxmgr1.abc-corp.org systemd 1 - - upgrade-coordinator.service: A process of this unit has been killed by the OOM killer.
2025-11-22T19:49:27.893Z nsxmgr1.abc-corp.org systemd 1 - - upgrade-coordinator.service: Failed with result 'oom-kill'.

Environment

VMware NSX

Cause

The initial trigger was the time synchronization failure within the NTP infrastructure. This caused the year to be randomly modified to 2142 in some the NSX Manager log files (this could potentially affect Edge VMs as well).

The reason that the year 2142 is a problem (beyond the obvious issue of being incorrect) is that it is after the year 2038. On January 19 2038, gzip will encounter a problem related to the counter it uses to measure time. Although it will be fixed well in advance of this date, the systems that use this method to measure time can run into the problem prematurely if there is a time synchronization issue that results in files being modified to one of these invalid dates in the future.

These systems use a 32-bit signed integer to represent the number of seconds since the Unix epoch (January 1, 1970), and will encounter an overflow when that count reaches 2,147,483,647, which occurs on January 19, 2038. The end result is that log files marked with these invalid time stamps will fail during compression and log rotation. These log files will continue to grow larger and larger, until /var/log reaches 100% usage and Management functionality fails.
The out of memory (OOM) messages observed within the syslog and kern.log files are another result of the log rotation issue. Once the fix described in the Resolution section is implemented, you will observe no further OOM messages of this type.

Resolution

Steps to fix this issue. Run on each Manager or Edge VM displaying the symptoms:

Find and notate the names of every file that was modified with the invalid date by examining the /system/ls_-althR_var file in the Manager logs:
$ grep " <date_found_in_logs> " nsx_manager_########-9bab-####-b674-############_20251126_######/system/ls_-althR_var | less
-rw-r----- 1 syslog adm 1.5M May 1 2142 mail.err
-rw-r----- 1 syslog adm 1.5M May 1 2142 mail.log
-rw-r----- 1 syslog adm 8.1M May 1 2142 kern.log.1
-rw-r----- 1 syslog adm 2.2M Apr 30 2142 auth.log.1.gz-2142043000.backup
-rw-r----- 1 syslog adm 3.4M Apr 30 2142 syslog.1.gz-2142043000.backup
-rw-r----- 1 syslog adm 49M Apr 28 2142 auth.log.1
-rw-r----- 1 syslog adm 33M Apr 28 2142 syslog.1
-rw-r----- 1 syslog adm 1.1M Apr 28 2142 rsyslog.log.1
-rw-r----- 1 root adm 297K May 1 2142 spring.4.log.gz
-rw-r----- 1 root adm 2.6M Apr 29 2142 message_trace.4.log.gz If desired, you can determine the total number of files with the modified invalid date by using this command in the Manager log bundle: $ grep " <date_found_in_logs> " nsx_manager_########-9bab-####-b674-############_20251126_133110/system/ls_-althR_var | wc -l

Alternatively, the files with the invalid dates can be found by invoking the following command in the Manager root shell:
# find /var/log -type f | sort | xargs ls -logh {} | grep " <date_found_in_logs> "
Delete all files with the invalid date.
Reboot the Manager

All 3 NSX Manager nodes are seeing /var/log usage increase up to 100%

Article ID: 420572

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback