HCX - NE appliance state becomes critical due to Memory component
search cancel

HCX - NE appliance state becomes critical due to Memory component

book

Article ID: 323278

calendar_today

Updated On:

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

This document is created as a reference for the HCX Network Extension (NE) appliance unexpected memory consumption and how to recover that.

Symptoms:
The memory allocated to a given Network Extension (NE) appliance may be exhausted during runtime and below errors could be seen in the appliance log:
2023-02-21T12:11:23+00:00 HCX-NE-I1 GatewayLogs[1057]: [Warning-ops] : Memory usage is probably high (free: %4)
2023-02-22T12:25:04+00:00 HCX-NE-I1 kernel: ip: page allocation failure: order:4, mode:0x6000c0(GFP_KERNEL), nodemask=(null)
2023-02-22T12:25:04+00:00 HCX-NE-I1 kernel: ip cpuset=/ mems_allowed=0
Accessing NE appliance via CCLI/SSH may or may not be serviced depending upon memory condition.
To verify current memory consumption for a given NE appliance:
Login to HCX Manager admin console >> ccli >> list >> go [NE_Appliance] >> ssh
root@HCX-NE-I1 [ ~ ]# cat /proc/meminfo 
MemTotal:        3075532 kB
MemFree:           75913 kB
MemAvailable:          0 kB  >>>>>>>
Note: If SSH is inaccessible via CCLI, then execute "show system memory" directly from CCLI:
admin@hcx [ ~ ]$ ccli
Welcome to HCX Central CLI

[admin@hcx] list
[admin@hcx] go 0
Switched to node 0.

[admin@HCX-NE-I1] show system memory
MemTotal:        3075532 kB
MemFree:           75913 kB
MemAvailable:          0 kB  >>>>>>>

Location of Appliance log:
HCX ManagerĀ  : /tmp/Fleet-Appliances/Service Mesh/NE-Appliance/var/log/messages
NE appliance : /var/log/messages


Cause

For each extended network, we allocate a separate vNic in the corresponding NE appliance, which consumes a specific amount of kernel memory for Jumbo packet ring in the device driver, also known as VMXNET3.
In the PhotonOS, the max ring size has been changed from "2048" to "4096" for the vNic device driver, which leads to double the kernel memory consumption.

As a result, when user configures upto 8 extensions per appliance, the combined memory consumption may exceed the available memory for the NE appliance.

Resolution

This issue is fixed in HCX 4.6.1 release.

Workaround:
As a workaround, user is recommended to follow below steps:
  1. Reduce the existing number of extensions upto 6 on each NE appliance, Standalone or in HA Pair.
  2. To recover from high memory consumption, perform "redeploy" on the impacted NE appliance.
Note: User is suggested to deploy additional NE appliance in the existing Service Mesh OR create new Service Mesh to deploy NE appliance to re-extend those networks again, just to maintain the limit of 6 per NE appliance.

Additional Information

Impact/Risks:
  • This will impact HCX NE appliance version 4.4 or later.
  • Additional Network Extension configuration won't be serviced.
  • Existing Network Extension should continue operating at Data Path layer.
  • There will be NO impact to migration services.