NSX Edge disk usage high when there are high number of Load Balancers
search cancel

NSX Edge disk usage high when there are high number of Load Balancers

book

Article ID: 404700

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • In VMware NSX 4.1.2.3 or 4.1.2.5 or 4.2.1.4, when there are above 400 Native Load Balancers setup in NSX, the NSX Edge node may report high disk usage. 
  • As more load balancers are created, more disk space is consumed on the NSX Edge node. 

  • May also see other NSX alarms such as "Management Channel To Transport Node Down Long" for the affected Edge node. 
  • Similar output as following is seen when checking the disk usage on the edge node:

root@<edge-node-1>:/var/lib/docker/overlay2# df -h | head -10
Filesystem                   Size  Used Avail Use% Mounted on
udev                         124G     0  124G   0% /dev
tmpfs                         38G   96M   38G   1% /run
/dev/sda2                     19G   18G     0 100% / <------------ 100% full root partition
tmpfs                        188G  2.8G  186G   2% /dev/shm
tmpfs                        5.0M     0  5.0M   0% /run/lock
tmpfs                        188G     0  188G   0% /sys/fs/cgroup
tmpfs                        2.0G     0  2.0G   0% /mnt/ids
/dev/mapper/nsx-config        19G  145M   18G   1% /config
/dev/sda1                    943M  7.1M  871M   1% /boot

  • Another view of which directory is full:

root@<edge-node-1>::/# du -xah --time --max-depth=3 /var/lib/docker/ | sort | grep G
14G     2025-06-18 11:41        /var/lib/docker/
14G     2025-06-18 11:41        /var/lib/docker/overlay2 <----------This is the directory causing / to be 100% full
4.0K    2025-05-04 02:54        /var/lib/docker/overlay2/l/<UUID>
4.0K    2025-05-04 02:54        /var/lib/docker/overlay2/l/<UUID>
4.0K    2025-05-04 02:54        /var/lib/docker/overlay2/l/<UUID>

Environment

VMware NSX 4.1.x

VMware NSX 4.2.0 through 4.2.3

Cause

This is caused by an issue in NSX Load Balancer setup script. 

Resolution

  • This issue will be fixed in a future version of NSX. 

  • For NSX version other than NSX 4.1.2.3 or 4.1.2.5 or 4.2.1.4, open a Broadcom Support Request referencing this KB. 

  • For NSX versions 4.1.2.3 and 4.1.2.5 and 4.2.1.4, use the below workaround.

    Note: Below workaround will cause down time and it is advised to complete the following step in a maintenance window. 

     

Workaround

 

Prerequisites:

  • Ensure that the apply_LB_fix.sh script which is available in this KB's attachment section. 
  • Get the SSH access to both Active and Standby NSX Edge nodes.
  • Understand that applying this patch will cause a brief interruption to active Load Balancer services during the failover process.

 

Steps:

  • Upload the Script:

      1. Upload the attached apply_LB_fix.sh script to both the Active and Standby Edge nodes.
      2. Do not save the script under root directory as there may not have any free space left. 
      3. The script can be saved to /tmp since it uses different storage mapping.

  • Apply Patch on Standby Edge first:

    1. Connect to the Standby Edge node via SSH.
    2. Execute the apply_LB_fix.sh script:

            i. Adjust permission for the file first:
          chmod +x /tmp/apply_LB_fix.sh

           ii. Run the following command
          bash /tmp/apply_LB_fix.sh

    3. This script will:
      1. Build a new, patched version of the LB container image.
      2. Trigger an HA failover (Only while applying on Active Edge, and the Standby becomes Active).
      3. Stop and delete all running LB service containers using the old image.

 

  • Post applying the Script:

    1. Refresh LB Containers on the Edge on which the script was run:
      1. In the NSX Manager UI, navigate to the now Active Edge (the same one on which the script was just run ) under Fabric > Nodes.
      2. Enter and then exit NSX Maintenance Mode on the Edge.
        • This will ensure all LB service containers are recreated using the patched container image.

    2. Verify LB Service Status:
      1. Connect to the Edge node via SSH.
      2. Run the command:
        • get load-balancers status
      3. Verify that all Load Balancers are in the ready state with the standby HA state. It may take a few minutes for the standby state to be reached. Example output:
        LB-State        : ready
        LR-HA-State     : standby

     

  • Apply Patch on the other Edge nodes in the cluster.
    Repeat steps for all other edge nodes in the cluster, one at a time. 

 

Note:

  • This script should not be applied to an Edge node unless the disk space is high with matching log entries. 

  • If the issue is seen only on Standby Edge, the script can be run on Standby Edge alone without triggering any automatic failover.

Attachments

apply_LB_fix.sh get_app