Edge Health Alarm : Edge Disk Usage Very High
The disk usage for the Edge node disk partition / has reached 84% which is at or above the very high threshold value of 80%
An NSX Edge Node reports a critical health alarm indicating high disk utilization on the /root partition (/dev/sda3). In this scenario, utilization exceeding 80% resulted in the following symptoms:
Failure of VMs to obtain DHCP IP addresses.
Disruption of Central Control Plane (CCP) connectivity.
Tunnels remaining in a "Down" state even after attempting Maintenance Mode cycles.
Coincident issues with VCD–NSX communication due to certificate mismatches.
VMware NSX
The issue was driven by two primary factors:
Disk Space Exhaustion: The /root partition reached 84% utilization due to large, unrotated files in the /journal and syslog directories. An attempt to move these files was unsuccessful, leading to a "filled bin" scenario within the partition.
Known Version Bug: NSX version 4.2.1 is impacted by a known JDK-related issue that can prevent services from recovering gracefully after disk space is reclaimed or management communication is interrupted.
VCD Communication Break: A renewed VCD certificate failed to apply to one of the four VCD cells, breaking the underlying management sync between VCD and NSX.
Ensure the renewed VCD certificate is applied consistently across all VCD cells.
Reconnect VCD to NSX via the VCD Service Provider Admin portal to validate credential/certificate handshake.
Log in to the affected NSX Edge Node CLI as root.
Identify large files in the /journal and /var/log directories.
Action: Move unrotated journal and syslog files to a temporary test directory or off-box storage to reduce /dev/sda3 utilization below the 80% threshold.
Verify cleanup with df -h.
If disk cleanup does not immediately restore tunnel or DHCP status:
Perform a rolling reboot of the NSX Manager cluster (one manager at a time).
This clears the JDK-related hang-up and forces a fresh reconciliation of the Edge nodes.
Edge Sharding Behavior: During the rolling reboot, an Edge node may briefly report a "Failed" state. This is expected behavior as the Edge attempts to "shard" (re-establish a heartbeat) to a different available Manager in the cluster. Communication restores automatically once the Managers are stable.
Verify that VMs are successfully receiving DHCP IP addresses and that ICMP connectivity (Ping) is restored.
*