HCX Fleet Appliances (IX/NE) are not logging information on the /var/log/messages due to the error "syslog-ng invoked oom-killer"

Products

VMware HCX

Issue/Introduction

You are running a version prior to HCX 4.11.1 on the HCX Managers and Fleet Appliances: HCX-NE and HCX-IX appliances.
When you need to troubleshoot an issue that affected the appliances, and when reviewing the relevant log: /var/log/messages (which contains information about the HCX tunnel, HA failover, BFD events, and other relevant services), you may find that there is no log information/relevant information.

From the /var/log/messages, a similar output is displayed:

<4>1 <timestamps> <hostname> kernel  - - syslog-ng invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
<6>1 <timestamps> <hostname> kernel  - - syslog-ng cpuset=/ mems_allowed=0
<4>1 <timestamps> <hostname> kernel  - - CPU: 0 PID: 20039 Comm: syslog-ng Tainted: G           OE     4.19.245-1.ph3-esx #1-photon
<4>1 <timestamps> <hostname> kernel  - - Call Trace:
<4>1 <timestamps> <hostname> kernel  - -  dump_stack+0x6d/0x8b
<4>1 <timestamps> <hostname> kernel  - -  dump_header+0x65/0x275
<4>1 <timestamps> <hostname> kernel  - -  ? __delayacct_freepages_end+0x25/0x30
<4>1 <timestamps> <hostname> kernel  - -  oom_kill_process+0x26b/0x2a0
<4>1 <timestamps> <hostname> kernel  - -  ? oom_badness.part.6+0xd/0x110
<4>1 <timestamps> <hostname> kernel  - -  out_of_memory+0xf3/0x2b0
<4>1 <timestamps> <hostname> kernel  - -  __alloc_pages_nodemask+0x87e/0xd40
<4>1 <timestamps> <hostname> kernel  - -  filemap_fault+0x342/0x660
<4>1 <timestamps> <hostname> kernel  - -  ext4_filemap_fault+0x2c/0x40
<4>1 <timestamps> <hostname> kernel  - -  __do_fault+0x32/0xa0
<4>1 <timestamps> <hostname> kernel  - -  do_fault+0x121/0x6b0
<4>1 <timestamps> <hostname> kernel  - -  ? ep_read_events_proc+0xb0/0xb0
<4>1 <timestamps> <hostname> kernel  - -  __handle_mm_fault+0x5de/0x680
<4>1 <timestamps> <hostname> kernel  - -  handle_mm_fault+0x10a/0x200
<4>1 <timestamps> <hostname> kernel  - -  __do_page_fault+0x1fa/0x3f0
<4>1 <timestamps> <hostname> kernel  - -  do_page_fault+0x22/0x30
<4>1 <timestamps> <hostname> kernel  - -  ? page_fault+0x8/0x30
<4>1 <timestamps> <hostname> kernel  - -  page_fault+0x1e/0x30
<4>1 <timestamps> <hostname> kernel  - - RIP: 0033:0x7f686411e4d0

The following event is observed on the /var/log/message, before the syslog oom-killer:

<132>1 <timestamps> <hostname> cgw 1104 - - [Warning-ops] : Memory usage is probably high (free: %3)

From the HCX Manager UI, under Interconnect -> Service Mesh, when viewing appliances and clicking the "i - info" icon, you see the alarm (Memory usage is high):

Environment

VMware HCX

Cause

A memory leak affecting the ndd process has been found on the HCX Fleet Appliances.
This causes high memory usage, and the Fleet Appliance is unable to allocate resources.

Resolution

This issue is resolved in VMware HCX 4.11.1, available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Note: If you are experiencing this issue in HCX Fleet appliances 4.11.1 or higher, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.

Workaround:

When the appliance is already affected by ndd and syslog is also affected, it's recommended to have a maintenance window to reboot the appliance.
Note: Rebooting the NE requires downtime and should be performed only during a maintenance window.
Once the appliance is rebooted, you can implement the following workaround for HCX 4.11.0 or below:
1. SSH into the HCX Manager as the admin user.
2. Once logged in, type:
  ccli
  list
  go # (where # is the NE appliance ID)
  ssh
  systemctl stop ndd
  systemctl disable ndd

Note: After disabling the ndd service on the NE Appliance VM, there will be no impact on the system from a traffic forwarding and stability perspective. However, the Transport Analytics feature will be non-functional for those NE Appliances. On-demand bandwidth testing can be used as an alternative to the Transport Analytics feature instead.

Note: If you are running HCX 4.11.0 or below, we recommend proactively implementing Workaround 2 to all appliances to prevent this issue in the future - this needs to be implemented on both the HCX NE-I (source/Initiator) and NE-R (target/receiver) appliances.

Additional Information

The /var/log/messages outputs are fundamental for troubleshooting complex issues. The absence of information logged to /var/log/messages due to a syslog issue will significantly affect the ability to provide a root cause.

VMware HCX 4.11.1 Release Notes
Fixed Issue 3528977: Long running Network Detection Daemon (ndd) process can cause the system to run out of memory on Network Extension (NE) and Interconnect (IX) appliances.
When the system is kept running for a long time, the ndd process will continue consuming memory and can eventually consume all available memory leading to system kernel errors.