HCX NE appliances fail unexpectedly due to a kernel-level panic

Products

VMware HCX

Issue/Introduction

The HCX Network Extension (NE) appliance may crash unexpectedly and reboot due to a kernel panic, which results in the NE appliance losing its functionality.
The appliance would reboot within 60 seconds after the crash and auto-recover.

From the vmware logs of the affected Appliance VM (Logs are present in ESXi), you will notice "get_rps_cpu+0x679/0x860" trace in vmware.log file as below:

Login to the ESXi host as user root via SSH
Execute the command esxcli vm process list |grep -i <HCX-NE-VM-Name>
Note down the location and execute the command : Example : cd /vmfs/volume/datastore-volume-uuid/HCX-NE-VM-folder/

Access the file vmware.log of the corresponding HCX-NE-VM

<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64] Call Trace:
<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  <IRQ>
<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? __die_body.cold+0x1a/0x1f
<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? die_addr+0x3d/0x67
<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? exc_general_protection+0x150/0x327
<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? asm_exc_general_protection+0x27/0x30
<timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? get_rps_cpu+0x679/0x860
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  netif_receive_skb_list_internal+0x261/0x2d7
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? vmxnet3_rq_rx_complete+0x62c/0xff0
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  napi_complete_done+0x74/0x197
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  vmxnet3_poll_rx_only+0x89/0xb7
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  __napi_poll+0x41/0x180
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  net_rx_action+0x3f9/0x4b7
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  handle_softirqs+0x9b/0x250
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  __irq_exit_rcu+0x90/0xc0
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  irq_exit_rcu+0xe/0x17
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  common_interrupt+0x8e/0xa7
<timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  </IRQ>

All the network traffic utilizing the affected NE appliance would be experiencing traffic interruption between on-premises and Cloud sites, and will be restored after the appliance reboots.
The following log is observed in the NE appliance log file /var/log/system_events:
{"id":0,"level":2,"timestamp":########,"UTC":"<timestamp>","message":"System initialized.","metadata":{}}
A boot workflow is observed in the NE appliance log file /var/log/messages:
<time_stamp> <Appliance-Name> syslog-ng 864 - - syslog-ng starting up; version='4.3.1'
<time_stamp> <Appliance-Name> kernel - - Linux version ...
<time_stamp> <Appliance-Name> kernel - - Kernel not locked down
<time_stamp> <Appliance-Name> kernel - - Command line: BOOT_IMAGE=/vmlinuz-6.1.141-5.ph5 root=PARTUUID=########-####-####-####-############ init=/lib/systemd/systemd ro loglevel=3 quiet
<time_stamp> page_poison=off slab_nomerge cgroup.memory=nokmem pti=off l1tf=off mds=off irqaffinity=0-2 isolcpus=3-7 rcutree.rcu_resched_ns=########## rcutree.rcu_max_blimit=###############
<time_stamp> =0 plymouth.enable=0 systemd.legacy_systemd_cgroup_controller=yes audit=1 fips=1 fips=1 audit=1
<time_stamp> <Appliance-Name> kernel - - Disabled fast string operations
<time_stamp> <Appliance-Name> kernel - - BIOS-provided physical RAM map:

Environment

HCX version 4.4 and above

Cause

Data plane appliances can experience kernel crashes when the hash of a new flow collides with the hash of an existing flow.

Resolution

This issue is resolved in VMware HCX 4.11.3, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

HCX NE appliances fail unexpectedly due to a kernel-level panic

Article ID: 406405

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback