NSX Intelligence appliance running out of disk space
search cancel

NSX Intelligence appliance running out of disk space

book

Article ID: 315189

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • The NSX Intelligence appliance is running out of disk space.
  • The NSX Intelligence appliance /data partition becomes full.
  • The NSX Intelligence feature is not functioning and the User Interface (UI) is not responding.


Environment

VMware NSX-T Data Center 2.5.x
VMware NSX-T Data Center

Cause

This issue occurs because the historical/inactive Spark worker directories located under /data/spark/worker parent directory are not being removed and that /data/spark/worker/driver-* directories contain the Spark's internal stdout log file, which is not rotating and keeps growing in size even when the appliance is not processing any new data.

This is causing the appliance to run out of disk space in the timeframe of several weeks to several months and eventually NSX Intelligence becomes non-functional.

Resolution

This is a known issue affecting VMware NSX-T Data Center 2.5.x.

Workaround:
To work around this issue:

If the issue has already occurred

  1. Delete all the data from /data/spark/worker/ directory by running this command: 

    rm -rf /data/spark/worker/*
     
  2. Delete all the data from /data/spark/flowCorrelator/checkpoints directory by running this command: 

    rm -rf /data/spark/flowCorrelator/checkpoints/*
     
  3. Delete all the data from /data/spark/local/ directory by running this command: 

    rm -rf /data/spark/local/*
     
  4. Reboot the NSX Intelligence appliance.

To prevent this issue from occurring

  1. Edit this file on NSX Intelligence appliance:

    /opt/apache-spark_2.x.x/conf/spark-defaults.conf
     
  2. Append the following 3 lines at the end of the file and save the file:

    spark.executor.logs.rolling.strategy size
    spark.executor.logs.rolling.maxRetainedFiles 10
    spark.executor.logs.rolling.maxSize 209715200

     
  3. Edit this file on NSX Intelligence appliance:

    /opt/apache-spark_2.x.x/conf/spark-env.sh
     
  4. Append the following configuration to the end of the value of SPARK_WORKER_OPTS environmental variable and save the file: (Don't forget the hyphen before the D at the beginning of each line.)

    -Dspark.executor.logs.rolling.strategy=size -Dspark.executor.logs.rolling.maxRetainedFiles=10 -Dspark.executor.logs.rolling.maxSize=209715200 -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=300 -Dspark.worker.cleanup.appDataTtl=10800 
     
  5. Create this new file on NSX Intelligence appliance:

    /etc/cron.d/spark_driver_log_cleanup_task
     
  6. Add this line to the file and save the file:

    0 * * * * spark find /data/spark/worker/driver-*/stdout -not -empty -print0 | xargs -r -0 truncate -s 0
     
  7. Restart the spark service on the appliance by running these commands:

    systemctl stop spark
    systemctl start spark