VCD Service Crash in VMware Cloud Director Due to ''log4j:ERROR Failed to flush writer''
search cancel

VCD Service Crash in VMware Cloud Director Due to ''log4j:ERROR Failed to flush writer''

book

Article ID: 399867

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

The VMware Cloud Director (VCD) service may fail to start or crash shortly after initialization.

  • A review of the cell.log file located in /opt/vmware/vcloud-director/logs/ reveals that the service encounters a java.lang.OutOfMemoryError, followed by a failed heap dump due to insufficient disk space. The cell-runtime.log also includes the error message:

    log4j:ERROR Failed to flush writer, java.io.IOException: No space left on device

    This issue results in the VCD service becoming non-operational and may impact tenant access or administrative functionality within Cloud Director.
  • There can be instances where the VCD services might not crash but the partition might be close to full and the database size increases very fast. 

Environment

VMware Cloud Director 10.6.0.1

Cause

The root cause was identified as storage exhaustion due to uncontrolled growth of the audit_trail table in the Cloud Director database. This table records all user and system activities, which can grow significantly in environments with integrations such as Container Service Extension (CSE) or Aria Operations.

Resolution

It is mandatory to take snapshots of all VCD cells and perform a database backup before proceeding.

Reduce the current size of audit_trail table :

*************************************************************

Note: This action requires downtime on all Cloud Director cells. 

Warning: A backup of the Cloud Director database must be taken before attempting to make any changes to it directly.

Take a backup of the Cloud Director database before making any changes.

  1. To take the backup from the VCD VAMI, see Backup and Restore of VMware Cloud Director Appliance

  2. To take the backup from the primary cell of the Cloud Director, navigate to /opt/vmware/appliance/bin and run the create-backup.sh script.

Now carry out the workaround using the below steps: 

  1. Stop services on all cells:

    /opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s

  2. Connect to the database:

    sudo -i -u postgres psql vcloud

  3. Clear the audit_trail table:

    truncate table audit_trail;

  4. Reclaim space and optimize the database:

    vacuum full;
    vacuum analyze;

  5. Start the VCD service on the first cell:

    systemctl start vmware-vcd

  6. Start VCD on remaining cells after the first cell is online.

*************************************************************

Configure automatic audit event cleanup: To avoid future uncontrolled growth, configure Cloud Director to retain audit logs for a limited time (e.g., 10 days):

/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.audittrail.history.days -v 10