Disk write of NSX Edge VMs periodically spikes on the hour
search cancel

Disk write of NSX Edge VMs periodically spikes on the hour

book

Article ID: 322871

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • vCenter performance chart of Edge VMs shows disk write periodically spikes on the hour.
Periodic_Disk_Spike.png
  • Other VMs might be suffered from degraded storage performance if many Edge VMs reside in the same physical storage.
  • Many 8MB files are generated on the hour in /var/log/journal/<machine-id> .


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

Edge appliances run integrity checker on the hour.
It executes find / -print0 | xargs -0 to check integrity of many files in the appliance.
Since 3.1.0 auditd logs execve system calls and the logs are stored in journal log.

Integrity checker passes tremendous numbers of arguments to xargs, and all the arguments of execve logs are considered as field name by journald.
So field hash table of a journal file grows rapidly beyond the threshold, and the file is rotated immediately.
Each journal file is 8MB at minimum.
Thus 8MB journal files rotate so fast and so many journal files are generated that large amount of disk write is triggered on the hour.

Manager VMs are not affected because auditd does not log execve system calls.

Resolution

This is a known issue affecting NSX-T 3.1.0 - 3.1.2.1.

The issue is resolved in NSX-T 3.1.3 or 3.2.


Workaround:
There are 2 workarounds.
  1. Mask systemd-journald-audit.socket.

    /bin/systemctl stop systemd-journald-audit.socket
    /bin/systemctl disable systemd-journald-audit.socket
    /bin/systemctl mask systemd-journald-audit.socket


    It is implemented on 3.1.3 and 3.2 or later.

    Then restart journald.

    systemctl restart systemd-journald
     
  2. Disable integrity checker.

    /opt/vmware/integrity-checker/bin/integrity_checker.py -f disable
One of them eliminates the periodic storage spike.

Additional Information

Impact/Risks:
Edge VMs trigger large disk write on the hour, at the same time.
It might degrade datastore performance if many Edge VMs reside in the same physical storage.
Other VMs might be suffered from such degraded storage performance.