- This may impact all HCX NE Appliances HA and without HA, where connection is unstable with HCX Manager.
- There will be NO impact to migration services.
This document is created as a reference for the HCX Network Extension (NE) appliance unexpected memory consumption and how to recover that.
HCX Network Extension (NE) appliance VM may experience high memory condition during operation stage.
Below dump could be seen in the logs:
2023-07-15T09:47:18+00:00 NE-R1 GatewayLogs[1061]: [Warning-ops] : Memory usage is probably high (free: %4) 2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: fatal error: runtime: out of memory 2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: runtime stack: 2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: runtime.throw(0xa466d5, 0x16) 2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: /usr/local/go/src/runtime/panic.go:774 +0x72 2023-07-15T23:53:14+00:00 NE-R1 ndd[1067]: runtime.sysMap(0xc07c000000, 0x4000000, 0xf11cd8)
Location of Appliance log:
HCX Manager : /tmp/Fleet-Appliances/Service Mesh/NE-Appliance/var/log/messages
NE appliance : /var/log/messages
Vmware HCX
This should be considered as a corner case, which may appear in those environments/infrastructure where network connection between NE Appliance and HCX Manager is unstable and somehow causing the NDD (Network Detection Daemon Service) grpc stream to end unexpectedly and keep reconnecting again.
This issue is fixed in HCX 4.8.0 version.
Workaround:
As a workaround, user is recommended to follow below steps:
# systemctl stop ndd # systemctl disable ndd
Note: Please apply the above step to all NE Appliances where OOM condition has been observed.
IMPORTANT: After disabling NDD service from NE Appliance VM, there won't be any impact in the system from traffic forwarding and stability perspective. However, Transport Analytics feature won't be functional after disabling NDD service for those NE Appliances. On-Demand bandwidth testing can be used as an alternative to the Transport Analytics feature instead.