HCX - NE appliance VM may experience system/kernel crash

search cancel

HCX - NE appliance VM may experience system/kernel crash

book

Article ID: 321639

calendar_today

Updated On:

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

Identify a known issue with HCX Network Extension appliance VM system/kernel crash and provide a procedure to clear it.

Symptoms:
HCX Network Extension (NE) appliance VM may experience system/kernel crash during operation stage.
Below dump could be seen in the logs:

2022-10-15T08:13:15.106Z| vcpu-2| I125: Guest: <1>[   92.248389] BUG: unable to handle kernel paging request at 0000000000025280
2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <6>[   92.248654] PGD 0 P4D 0
2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[   92.248881] Oops: 0000 [#1] SMP PTI
2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[   92.248953] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           OE     4.19.245-1.ph3-esx #1-photon
2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[   92.249028] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[   92.249120] RIP: 0010:get_rps_cpu+0x89e/0x920
2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[   92.249162] Code: ff ff ff e9 aa f8 ff ff 8b 76 1c 89 b5 70 ff ff ff 49 63 ca 48 c7 c2 00 52 02 00 44 8b 85 70 ff ff ff 48 8b 34 cd c0 b4 9b bc <8b> 8c 32 80 00 00 00 8b 94 16 ec 00 00 00 41 29 c8 29 ca 44 39 c2

IMPORTANT:- The Network Extension appliance VM will go through a reboot during crash event as part of self recovery process.

Location of crash dump:

ESXi host : Go to Network Extension VM directory : vmware.log

Cause

This is a buggy behavior identified in get_rps_cpu() component running on Network Extension appliance.
This usually happens at slow kernel networking initialization OR in few rare unknown system abnormal conditions.

Note: This is purely a datapath symptom which may get triggered when some of special workload VMs with specific traffic type connected to NE appliance over a given extended segment. For example: Splunk VM.

Resolution

This is fixed in HCX 4.5.2 release.

Workaround:
As soon as crash is being observed or noticed on a given NE appliance, the recommendation is to follow below steps:

Try to isolate the workload VMs which may have been recently migrated and sitting on the extended network corresponding to the NE appliance which has crashed.
Upon identification of any such VM, try to collect the traffic profile and incase it appears to be a busy VM based on certain traffic types, then follow next steps:
- Disconnect that specific workload VM from L2E extended segment Or, reverse migrate the VM back to OnPrem/Source side to avoid bridge data over extended datapath.
- There is NO need to redeploy NE appliance since the appliance VM will perform self reboot and should recover from its state.
There is NO need to disconnect or reverse migrate all workload VMs connected to that Network extension appliance VM and they should continue operating normally using same NE appliance upon disconnection of that special workload VM.

Additional Information

Impact/Risks:

All HCX versions are affected.
Network extension service will be affected during system/kernel crash.
There will be NO impact to HCX migration services.

Feedback

thumb_up Yes

thumb_down No