HCX Network tasks failing due to high memory usage by the "ndd" process

Products

VMware HCX

Issue/Introduction

While performing an HCX network task, such as extension, unextension, or enabling MON (Mobility Optimized Networking), the following error is observed in /common/logs/admin/app.log:

<timestamps> UTC [NetworkStretchService_SvcThread-154, j: ########, s: ########, , TxId: ########-####-####-####-############] ERROR c.v.v.h.n.i.AbstractJobInt- InterconnectServiceJobs workflow InterconnectServiceConfigJob failed. Error: Interconnect Service Workflow GenerateAndPostConfig failed. Error: Operation timedout in state POST_CONFIG_VIX

<timestamps> UTC [NetworkStretchService_SvcThread-154, j: ########, s: ########, , TxId: ########-####-####-####-############] ERROR c.v.v.h.n.i.UnstretchNetworkJobInt- Error encountered in Unstretch network job
java.lang.RuntimeException: Interconnect Service Workflow GenerateAndPostConfig failed. Error: Operation timedout in state POST_CONFIG_VIX

From the HCX Manager UI, under Interconnect -> Service Mesh, when viewing appliances and clicking the "i - info" icon, you see the alarm:
System state is critical
Config engine is in systemdBad state
Memory usage is high
To confirm the process using high memory, follow these steps:

SSH into the HCX Manager as the admin user.

Once logged in, type:

ccli
list

go # (where # is the NE appliance ID)

Run command 'show system memory' to check memory.

[admin@HCX-NE-R#] show system memory
MemTotal:        3075532 kB
MemFree:           75913 kB
MemAvailable:          15120 kB  >>>>>>>

ssh
top
Press 'Shift + M' >> To check top memory used process.

Below logs is noticed on NE /var/log/messages.

<timestamp> <Fleet-Appliance> cgw 1098 - - [Info-Tasker] : Timeout vmware-toolbox-cmd stat balloon
<timestamp> <Fleet-Appliance> cgw 1098 - - [Err-Tasker] : cmd (/usr/bin/vmware-toolbox-cmd stat balloon) done, error: Timeout
<timestamp> <Fleet-Appliance> cgw 1098 - - [Err-ops] : getBalloonStat() failed, /usr/bin/vmware-toolbox-cmd stat balloon: Timeout
<timestamp> <Fleet-Appliance> cgw 1098 - - [Warning-ops] : Memory usage is probably high (free: %3)
<timestamp> <Fleet-Appliance> cgw 1098 - - [Info-opsEvent] : new system event: SystemEvent[<timestamp>, <timestamp>, 60002, critical, Memory usage is high, map[balloon:0 MB cache:32772096 free:102031360 total:3149344768 used:3047313408]]

Environment

VMware HCX

Cause

A memory leak affecting the ndd process has been found on the NE appliance.
This causes high memory usage, and the NE appliance is unable to allocate resources, causing tasks to fail.

Resolution

This issue is resolved in VMware HCX 4.11.1, available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

For the appliances in Config engine is in systemdBad state:
- Redeploy the affected appliance using the Force option. For more information, see: Manage Service Mesh Appliances
- Once the redeploy is complete, stop and disable the ndd process to prevent this issue from reoccurring.
  
  Note: Force redeploy of the NE requires downtime and should be performed only during a maintenance window.
  
  If the redeploy is failing, please open a support case with Broadcom Support and refer to this KB article.
  For more information, see Creating and managing Broadcom support cases.
For appliances showing: Memory usage is high AND not showing the Config engine is in systemBad state, proceed with the following workaround:
1. SSH into the HCX Manager as the admin user.
2. Once logged in, type:
  - ccli
  - list
  - go # (where # is the NE appliance ID)
  - ssh
  - systemctl stop ndd
  - systemctl disable ndd

Note: After disabling the ndd service on the NE Appliance VM, there will be no impact on the system from a traffic forwarding and stability perspective. However, the Transport Analytics feature will be non-functional for those NE Appliances. On-demand bandwidth testing can be used as an alternative to the Transport Analytics feature instead.

Note: If you are running HCX 4.11.0 or below, we recommend proactively implementing Workaround 2 to prevent this issue in the future until we release a patch.
This needs to be done on both the HCX NE-I (source/Initiator) and NE-R (target/receiver) appliances.

Additional Information

VMware HCX 4.11.1 Release Notes, see:
Fixed Issue 3528977: Long running Network Detection Daemon (ndd) process can cause the system to run out of memory on Network Extension (NE) and Interconnect (IX) appliances.

KB: HCX NE tunnels down due to high memory usage