All 3 NSX Manager nodes are seeing /var/log usage increase up to 100%
search cancel

All 3 NSX Manager nodes are seeing /var/log usage increase up to 100%

book

Article ID: 420572

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Files located within /var/log or subdirectories of /var/log are increasing in size without the compression and rotation that is normally seen.
    [root@nsxmgr-01a -]# df -h
    Filesystem                   Size    Used   Avail  Use%  Mounted on
    tmpfs                        4.8G    1.4M   4.8G   1%    /run
    /dev/sda2                    11G     4.9G   4.9G   50%   /
    tmpfs                        24G     4.0M   24G    1%    /dev/shm
    tmpfs                        5.0M    0      5.0M   0%    /run/lock
    /dev/sda1                    943M    7.1M   871M   1%    /boot
    /dev/mapper/nsx-image        43G     1.5G   39G    4%    /image
    /dev/mapper/nsx-config__bak  29G     65M    28G    1%    /config
    /dev/mapper/nsx-config       29G     55M    28G    1%    /config_bak
    /dev/mapper/nsx-secondary    99G     1.7G   92G    2%    /nonconfig
    /dev/mapper/nsx-repository   31G     9.3G   20G    32%   /repository
    /dev/mapper/nsx-tmp          3.7G    3.5M   3.5G   1%    /tmp
    /dev/mapper/nsx-var+dump     9.3G    24K    8.8G   1%    /var/dump
    /dev/mapper/nsx-var+log      27G     26G    257M   100%  /var/log  
    tmpfs                        4.8G    4.0K   4.8G   1%    /run/user/1007

    Example:
    Within /etc/logrotate.d/nsx-audit and /etc/logrotate.d/rsyslog, you will find the log rotation configuration for both the nsx-audit.log file as well as the syslog file.  Relevant output below: 
    /var/log/nsx-audit.log
    {
            rotate 10
            size 1M
    }

    /var/log/syslog
    {
            rotate 56
            size 32M
            compress
    }


    • Below you can see what the nsx-audit.log files should look like in a normal situation:  
      -rw-rw-r--  1 m##07##0 om_vmware_support 416K Nov 24 12:29 nsx-audit.log
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.1
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.2M Nov 24 12:29 nsx-audit.log.10
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.2
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.3
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.2M Nov 24 12:28 nsx-audit.log.4
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.2M Nov 24 12:29 nsx-audit.log.5
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.6
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.1M Nov 24 12:29 nsx-audit.log.7
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.2M Nov 24 12:29 nsx-audit.log.8
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.2M Nov 24 12:28 nsx-audit.log.9

      There are 10 files, roughly 1M in size or a little over.  These are numbered 1 through 10.  The current log file is observed to be smaller, as it is still filling up.  This matches what the files should look like when compared to the nsx-audit.log logrotate.d configuration seen above.

    • The following is similar to what is seen instead, using syslog as an example.  When comparing to the syslog logrotate.d configuration seen above, we see some discrepancies.  The main observation here is that instead of a maximum file size of roughly 32M, the log file being filled currently is 1.7G's in size, substantially larger than its log rotation configuration should allow.  You may also note that there are less numbered files than expected: 
      -rw-rw-r--  1 m##07##0 om_vmware_support 1.7G Nov 24 12:29 syslog
      -rw-rw-r--  1 m##07##0 om_vmware_support  33M Nov 24 12:29 syslog.1
                                                                      
  • You had an issue with your NTP infrastructure recently, which caused time synchronization between network devices to break.  You may or may not be aware of this NTP event.  

  • You may see log messages similar to the below (note that the year 2142 seen below could be a different number):
    • rsyslog.log:
      1970-01-01T00:00:00.814Z nsxmgr1.abc-corp.org rsyslogd - - -  syslogTime2time_t: invalid year 2142 in timestamp - returning 1970-01-01 instead [v8.2308.0]
      1970-01-01T00:00:00.290Z nsxmgr1.abc-corp.org rsyslogd - - -   message repeated 498 times: [syslogTime2time_t: invalid year 2142 in timestamp - returning 1970-01-01 instead [v8.2308.0]]
      1970-01-01T00:00:00.437Z nsxmgr1.abc-corp.org rsyslogd - - -  rsyslogd[internal_messages]: 3556 messages lost due to rate-limiting (500 allowed within 5 seconds)

    • Multiple different log files displaying messages stating: "The timer fell behind the system clock by <number>ms." 
      These may be found within /search/search-wrapper.log, /idps-reporting/tanuki.log, upgrade-coordinator-tomcat-wrapper.log, phonehome-coordinator-tomcat-wrapper.log, and potentially others. 
      proxy-tomcat-wrapper.log:
      INFO   | wrapper  | 2025/11/07 21:06:42 | The timer fell behind the system clock by 839909076ms.
      INFO   | jvm 1    | 2025/11/07 21:06:42 | WrapperManager: The timer fell behind the system clock by 839,909,176 ms

  • The invalid date observed within rsyslog may be seen as the time stamp for some of the messages found within the log files I mentioned above. 
    phonehome-coordinator-tomcat-wrapper.log:
    STATUS | wrapper  | 2142/05/01 00:57:00 | Launching a JVM...
    STATUS | wrapper  | 2142/05/01 01:41:17 | JVM received a signal SIGKILL (9).
    STATUS | wrapper  | 2142/05/01 01:41:17 | JVM process is gone.
    STATUS | wrapper  | 2142/05/01 01:41:17 | JVM process exited with a code of 1, setting the Wrapper exit code to 1.
    ERROR  | wrapper  | 2142/05/01 01:41:17 | JVM exited unexpectedly.
    STATUS | wrapper  | 2142/05/01 01:41:17 | TERM trapped.  Shutting down.
    STATUS | wrapper  | 2142/05/01 01:41:17 | <-- Wrapper Stopped
    STATUS | wrapper  | 2142/05/01 01:41:48 | --> Wrapper Started as Daemon
    STATUS | wrapper  | 2142/05/01 01:41:48 |  
    STATUS | wrapper  | 2142/05/01 01:41:55 | Launching a JVM...
    STATUS | wrapper  | 2142/05/01 02:31:26 | TERM trapped.  Shutting down.
    ERROR  | wrapper  | 2142/05/01 02:32:01 | Shutdown failed: Timed out waiting for signal from JVM.
    STATUS | wrapper  | 2142/05/01 02:32:01 | Dumping JVM state.

  • Within the syslog and/or kern.log files, you may see messages similar to below, referring to processes running out of memory:

    2025-11-22T19:49:26.849Z nsxmgr1.abc-corp.org kernel - - - [1125633.422729] cat invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
    2025-11-22T19:49:26.860Z nsxmgr1.abc-corp.org kernel - - - [1125633.422849]  [<ffffffff811dce2a>] oom_kill_process+0x12a/0x130 ffffc90006ce7bf8
    2025-11-22T19:49:26.864Z nsxmgr1.abc-corp.org kernel - - - [1125633.422860]  [<ffffffff81235be7>] __alloc_pages_may_oom+0x147/0x260 ffffc90006ce7c58
    2025-11-22T19:49:26.950Z nsxmgr1.abc-corp.org - - - [1125633.423129] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
    2025-11-22T19:49:26.127Z nsxmgr1.abc-corp.org - - - [1125633.423384] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(nil),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/upgrade-coordinator.service,task=java,pid=3870,uid=112
    2025-11-22T19:49:26.127Z nsxmgr1.abc-corp.org - - - [1125633.423606] Out of memory: Killed process 3870 (java) total-vm:2643348kB, anon-rss:1113764kB, file-rss:0kB, shmem-rss:0kB, UID:112 pgtables:2668kB oom_score_adj:500
    2025-11-22T19:49:26.174Z nsxmgr1.abc-corp.org systemd 1 - -  upgrade-coordinator.service: A process of this unit has been killed by the OOM killer.
    2025-11-22T19:49:27.893Z nsxmgr1.abc-corp.org systemd 1 - -  upgrade-coordinator.service: Failed with result 'oom-kill'.

     

Environment

VMware NSX

Cause

  • The initial trigger was the time synchronization failure within the NTP infrastructure.  This caused the year to be randomly modified to 2142 in some the NSX Manager log files (this could potentially affect Edge VMs as well). 

    The reason that the year 2142 is a problem (beyond the obvious issue of being incorrect) is that it is after the year 2038.  On January 19 2038, gzip will encounter a problem related to the counter it uses to measure time.  Although it will be fixed well in advance of this date, the systems that use this method to measure time can run into the problem prematurely if there is a time synchronization issue that results in files being modified to one of these invalid dates in the future. 

    These systems use a 32-bit signed integer to represent the number of seconds since the Unix epoch (January 1, 1970), and will encounter an overflow when that count reaches 2,147,483,647, which occurs on January 19, 2038.  The end result is that log files marked with these invalid time stamps will fail during compression and log rotation.  These log files will continue to grow larger and larger, until /var/log reaches 100% usage and Management functionality fails.
     
  • The out of memory (OOM) messages observed within the syslog and kern.log files are another result of the log rotation issue.  Once the fix described in the Resolution section is implemented, you will observe no further OOM messages of this type.
     

Resolution

Steps to fix this issue.  Run on each Manager or Edge VM displaying the symptoms:

  1. Find and notate the names of every file that was modified with the invalid date by examining the /system/ls_-althR_var file in the Manager logs:
    $ grep " <date_found_in_logs> " nsx_manager_########-9bab-####-b674-############_20251126_######/system/ls_-althR_var | less
    -rw-r-----   1 syslog         adm          1.5M May   1  2142   mail.err
    -rw-r-----   1 syslog         adm          1.5M May   1  2142   mail.log
    -rw-r-----   1 syslog         adm          8.1M May   1  2142   kern.log.1
    -rw-r-----   1 syslog         adm          2.2M Apr  30  2142   auth.log.1.gz-2142043000.backup
    -rw-r-----   1 syslog         adm          3.4M Apr  30  2142   syslog.1.gz-2142043000.backup
    -rw-r-----   1 syslog         adm           49M Apr  28  2142   auth.log.1
    -rw-r-----   1 syslog         adm           33M Apr  28  2142   syslog.1
    -rw-r-----   1 syslog         adm          1.1M Apr  28  2142   rsyslog.log.1
    -rw-r-----   1 root           adm          297K May   1  2142   spring.4.log.gz
    -rw-r-----   1 root           adm          2.6M Apr  29  2142   message_trace.4.log.gz

    If desired, you can determine the total number of files with the modified invalid date by using this command in the Manager log bundle:
    $ grep " <date_found_in_logs> " nsx_manager_########-9bab-####-b674-############_20251126_133110/system/ls_-althR_var | wc -l


    Alternatively, the files with the invalid dates can be found by invoking the following command in the Manager root shell:
    # find /var/log -type f | sort | xargs ls -logh {} | grep " <date_found_in_logs> "

  2. Delete all files with the invalid date.

  3. Reboot the Manager