HCX - High Memory observed on NE Appliance with Error "Cannot allocate memory"

search cancel

HCX - High Memory observed on NE Appliance with Error "Cannot allocate memory"

book

Article ID: 321570

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

This document is created as a reference for the HCX Network Extension (NE) appliance unexpected memory consumption and how to recover that.

Symptoms:
HCX Network Extension (NE) appliance VM may experience high memory condition during operation stage.
Below dump could be seen in the logs:

2023-07-15T09:47:18+00:00 NE-R1 GatewayLogs[1061]: [Warning-ops] : Memory usage is probably high (free: %4)
2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: fatal error: runtime: out of memory
2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: runtime stack:
2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: runtime.throw(0xa466d5, 0x16)
2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]:   /usr/local/go/src/runtime/panic.go:774 +0x72
2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: runtime.sysMap(0xc07c000000, 0x4000000, 0xf11cd8)

Location of Appliance log:
HCX Manager : /tmp/Fleet-Appliances/Service Mesh/NE-Appliance/var/log/messages
NE appliance : /var/log/messages

Cause

This should be considered as a corner case, which may appear in those environments/infrastructure where network connection between NE Appliance and HCX Manager is unstable and somehow causing the NDD (Network Detection Daemon Service) grpc stream to end unexpectedly and keep reconnecting again.

Resolution

This issue is fixed in HCX 4.8.0 version.

Workaround:
As a workaround, user is recommended to follow below steps:

Please ensure the network connectivity should be restored between HCX Manager and all respective NE Appliance to avoid getting into this condition.
If environment is highly unstable, then recommendation is to disable NDD service running inside HCX NE Appliance.
- Login to HCX Manager admin console >> ccli >> list >> go [NE_Appliance] >> ssh

# systemctl stop ndd
# systemctl disable ndd

Note: Please apply the above step to all NE Appliances where OOM condition has been observed.
IMPORTANT: After disabling NDD service from NE Appliance VM, there won't be any impact in the system from traffic forwarding and stability perspective. However, Transport Analytics feature won't be functional after disabling NDD service for those NE Appliances. On-Demand bandwidth testing can be used as an alternative to the Transport Analytics feature instead.

Additional Information

Impact/Risks:

This may impact all HCX NE Appliances HA and without HA, where connection is unstable with HCX Manager.
There will be NO impact to migration services.

Feedback

thumb_up Yes

thumb_down No