NSX Manager /nonconfig Partition Full Due to Unpurged IDS Event Data.

Products

VMware vDefend Firewall

Issue/Introduction

In NSX version 3.2.x, the /nonconfig partition on NSX Manager may become full or nearly full due to the ids_event_data table not being purged.

This occurs when the purge job fails to fetch Corfu records, leading to a continuous accumulation of IDS/IPS event data. As a result, the Search Indexing process fails with Java OutOfMemory errors, and system performance can degrade significantly.

Environment

VMware NSX

vDefend Firewall

Cause

In NSX 3.2.x releases, IDS/IPS event retention is configured for 14 days or a maximum of 1.5 million records, whichever is reached first.

There are two purge jobs responsible for event cleanup:

One that monitors event count and purges older records when the count exceeds the threshold.
Another that deletes events older than 14 days.

In affected versions (including 3.2.2.1):

The 1.5 million event threshold is a soft limit, only triggering an alarm.
The purge job can encounter an OutOfMemory (OOM) exception when fetching the keyset for large datasets (~1.5M+ records).
Once this failure occurs, the purge job continuously fails, allowing events to accumulate uncontrollably (observed cases reached 40M+ events).
Logs show failures acquiring the distributed lock, preventing the purge job from proceeding:

log/idps-reporting/idps.log

INFO DistributedLockThread DistributedLockImpl 7418 - [nsx@6876 comp="distributed-lock" level="INFO" subcomp="DistributedLockImpl"] Unable to acquire distributed lock ids_events_purge_distributed_lock due to com.vmware.nsx.platform.clustering.persistence.exceptions.DuplicateObjectException

Additionally, Search Indexing failures are observed due to memory exhaustion:

log/idps-reporting/idps.log

ERROR pool-103-thread-1 UfoIndexingServiceImpl 6367 - [nsx@6876 comp="nsx-manager" errorCode="MP60503" level="ERROR" subcomp="idps-reporting"] [Indexing:ProcessTable] Exception during indexing table ids_event_data
java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.resize(HashMap.java:705)
    at java.util.HashMap.putVal(HashMap.java:664)
    at java.util.HashMap.put(HashMap.java:613)
    at java.util.HashSet.add(HashSet.java:220)
  at org.corfudb.runtime.collections.PersistedStreamingMap.keySet(PersistedStreamingMap.java:249)

Resolution

For systems already affected where the /nonconfig partition is full, follow the below steps carefully to manually clean up and restore functionality.

Sample logs from the affected managers where the usage is almost full:

nsx_manager_********_20251001_093540/system/df_-alT:/dev/mapper/nsx-secondary   ext4   >102G   100% /nonconfig
nsx_manager_********_20251001_093543/system/df_-alT:/dev/mapper/nsx-secondary   ext4   >102G    78% /nonconfig
nsx_manager_********_20251001_094622/system/df_-alT:/dev/mapper/nsx-secondary   ext4   >102G    88% /nonconfig

Workaround:

Step-by-Step Procedure:

Take a Manager backup before proceeding.

Monitor /nonconfig partition usage on all NSX Managers:

Example output:

Stop IDPS Reporting Service on all 3 Managers:
```
/etc/init.d/idps-reporting-service stop
```
Stop Corfu Nonconfig Server on all 3 Managers:
```
/etc/init.d/corfu-nonconfig-server stop
```
Manually clear nonconfig data (run on all 3 Managers):

Before deleting any files, verify the Corfu layout file consistency:
- Open /nonconfig/corfu/corfu/LAYOUT_CURRENT.ds and ensure that there is only one entry in the "segments" array, with:
  
  "start": 0,
  
  "end": -1
- This ensures that deleting the data avoids unnecessary hole fills and state transfers when the Corfu nonconfig servers restart.
Once confirmed, proceed to clear accumulated data:
```
rm -rf /nonconfig/corfu/corfu/*SEGMENT*.ds
rm -rf /nonconfig/corfu/corfu/log/*
rm -rf /nonconfig/browser/*
rm -rf /nonconfig/diskonlycorfutable/idps/*
```

Start Corfu Nonconfig Server:

Start IDPS Reporting Service:

Verify service status:
```
su admin -c get cluster status
```

Resync IDPS Reporting Search Index:

Re-check /nonconfig partition usage to ensure cleanup succeeded:

Example post-cleanup output:

Additional Information

For environments where disk usage is moderate and database cleanup is possible, refer to the official KB for clearing older IDPS data:

Broadcom KB 385626 – Clearing Old IDPS Events from Database