HCX - NE VM kernel crash data collection and recovery
search cancel

HCX - NE VM kernel crash data collection and recovery

book

Article ID: 323357

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

To provide the data collection steps required to investigate the root cause of an NE VM kernel crash

An HCX Network Extension (NE) appliance VM may experienced a kernel crash where the NE VM remains offline until it it is manually recovered in vCenter by a power off/on or reset 

In the vCenter UI there is a message that the NE VM was disabled and requires a power off/on or reset

On the source host where the NE VM resides, the hostd.log shows the same message for the NE VM:
"The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

Log location: On ESXi host NE VM resides -- /var/run/log/hostd.log

2023-08-18T02:36:50.226Z verbose hostd[2099988] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-e4434b77f338/###-ServiceMesh-NE-I2-Nqs-Redeploying/###-ServiceMesh-NE-I2-Nqs-Redeploying.vmx opID=lro-#########-########-01-01-3e-a860] Handling vmx message 9423161: The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

2023-08-18T02:36:50.226Z warning hostd[2099988] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-e4434b77f338/XXX-ServiceMesh-NE-I2-Nqs-Redeploying/XXX-ServiceMesh-NE-I2-Nqs-Redeploying.vmx opID=lro-#########-########-01-01-3e-a860] Failed to find activation record, event user unknown.

2023-08-18T02:36:50.227Z info hostd[2099988] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=lro-#########-########-01-01-3e-a860] Event 3429 : Message on XXX-ServiceMesh-NE-I2 on sv136284.XXX.com in ha-datacenter: The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.


On the source host where the NE VM resides, the vmware.log associated with the NE VM shows a vmkernel panic / crash relating to 'skbuff: skb_under_panic' and 'kernel BUG at net/core/skbuff.c:104!'

Log location: On ESXi host NE VM resides -- /vmfs/volumes/<Datastore_name>/###-ServiceMesh-NE-I2-Nqs-Redeploying/vmware.log

2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <0>[1453356.452382] skbuff: skb_under_panic: text:00000000717a3dbe len:1434 put:8 head:00000000536df6e8 data:0000000080e65197 tail:0x594 end:0x6c0 dev:ipip_te_0
2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.452593] ------------[] cut here ]------------

2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <2>[1453356.452594] kernel BUG at net/core/skbuff.c:104!
2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.452688] invalid opcode: 0000 [#1] SMP NOPTI
2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.452748] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           OE     4.19.245-1.ph3-esx #1-photon
2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.452816] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.452901] RIP: 0010:skb_panic+0x4a/0x50

2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.452938] Code: 00 00 50 8b 87 d0 00 00 00 50 8b 87 cc 00 00 00 50 ff b7 e0 00 00 00 4c 8b 8f d8 00 00 00 48 c7 c7 e8 e4 9a a3 e8 72 4b 11 00 <0f> 0b 0f 1f 40 00 48 8b 97 e0 00 00 00 89 f0 01 b7 80 00 00 00 48
2023-08-18T02:36:50.218Z In(05) vcpu-4 - Guest: <4>[1453356.453068] RSP: 0000:ffffbe9f0017c410 EFLAGS: 00010282
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453112] RAX: 000000000000008c RBX: ffff9e5629003300 RCX: 0000000000000000
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453167] RDX: ffff9e567cb21c60 RSI: ffff9e567cb1b088 RDI: ffff9e567cb1b088
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453222] RBP: ffffbe9f0017c430 R08: 0000000000000000 R09: 000000000000059d
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453280] R10: 000000000f4a0b8c R11: 642030633678303a R12: ffffbe9f0017c4fc
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453337] R13: ffff9e55eb14181c R14: ffffbe9f0017c500 R15: 0000000000009411
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453414] FS:  0000000000000000(0000) GS:ffff9e567cb00000(0000) knlGS:0000000000000000
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453491] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453541] CR2: 0000000000000000 CR3: 0000000097a0a001 CR4: 00000000001606a0

2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453619] Call Trace:
2023-08-18T02:36:50.219Z In(05) vcpu-4 - Guest: <4>[1453356.453649]  <IRQ> </IRQ>
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456140] RIP: 0010:native_safe_halt+0x17/0x20
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456183] Code: 48 8b 00 a8 08 0f 84 76 ff ff ff eb bd 90 90 90 90 90 90 8b 05 3a 08 57 00 55 48 89 e5 85 c0 7e 07 0f 00 2d db a5 1c 00 fb f4 <5d> c3 0f 1f 80 00 00 00 00 8b 05 1a 08 57 00 55 48 89 e5 85 c0 7e
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456323] RSP: 0000:ffffbe9f000a7ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456388] RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffff9e567cb1f100
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456450] RDX: ffffffffa3a2eff8 RSI: ffff9e567cb1f100 RDI: 000529d1f020e627
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456511] RBP: ffffbe9f000a7ea0 R08: 0000000000000000 R09: ffff9e567cb24200
2023-08-18T02:36:50.223Z In(05) vcpu-4 - Guest: <4>[1453356.456574] R10: ffffbe9f000a7e88 R11: 0000000000000000 R12: ffffffffa3a877c0
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456636] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456699]  default_idle+0x10/0x20
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456738]  arch_cpu_idle+0x10/0x20
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456774]  default_idle_call+0x1e/0x30
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456814]  do_idle+0x1c9/0x1f0
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456849]  cpu_startup_entry+0x5f/0x70
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456888]  start_secondary+0x19d/0x1e0
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456927]  secondary_startup_64_no_verify+0xca/0xcb
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.456972] Modules linked in: drbg ansi_cprng seqiv esp4(E) xfrm6_mode_tunnel(E) xfrm4_mode_tunnel(E) xt_u32(E) xt_nat(E) xt_cpu(E) xt_multiport(E) xt_connmark(E)xt_mark(E) ebt_arp(E) ebt_dnat(E) ebtable_nat(E) ebtable_filter(E) ebtables(E) nf_log_ipv4(E) nf_log_common(E) xt_limit(E) iptable_raw(E) arptable_filter(E) ip6table_mangle(E) ip6table_nat(E) iptable_mangle(E) iptable_nat(E) nf_conntrack_netlink(E) nfnetlink(E) xt_LOG(E) dummy(E) openvswitch(E) nsh(E) nf_nat_ipv6(E) nf_nat_ipv4(E) nf_conncount(E) nf_nat(E) xt_policy(E) xt_state(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) mousedev(E) psmouse(E) evdev(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) br_netfilter(E) ip_gre(E) fou(E) ip6_udp_tunnel(E) udp_tunnel(E) vxlan_trunk(E) bridge(E) stp(E) arp_tables(E)
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.457529]  llc(E) ipip(E) tunnel4(E) ip_tunnel(E) rdrand_rng(E) rng_core(E) aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sr_mod(E) cdrom(E) floppy(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) ipv6(E)
2023-08-18T02:36:50.224Z In(05) vcpu-4 - Guest: <4>[1453356.457715] ---[] end trace 0119df309df630a4 ]---
2023-08-18T02:36:50.225Z In(05) vcpu-0 - Vix: [vmxCommands.c:7182]: VMAutomation_HandleCLIHLTEvent. Do nothing.
2023-08-18T02:36:50.225Z In(05) vcpu-0 - MsgHint: msg.monitorevent.halt

2023-08-18T02:36:50.225Z In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.
2023-08-18T02:36:50.225Z In(05)+ vcpu-0 - ---------------------------------------
2023-08-18T02:36:50.227Z In(05) vcpu-0 - VigorTransportProcessClientPayload: opID=lro-#########-########-01-01-3e-a860 seq=1087320: Receiving Bootstrap.MessageReply request.
2023-08-18T02:36:50.227Z In(05) vcpu-0 - VigorTransport_ServerSendResponse opID=lro-#########-########-01-01-3e-a860 seq=1087320: Completed Bootstrap request.
2023-08-18T02:36:50.227Z In(05) vcpu-4 - Guest: <5>[    0.000000] Linux version 4.19.245-1.ph3-esx (root@photon) (gcc version 7.3.0 (GCC)) #1-photon SMP Thu Nov 10 19:21:49 UTC 2022
2023-08-18T02:36:50.227Z In(05) vcpu-4 - Guest: <6>[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.245-1.ph3-esx root=/dev/sda2 init=/lib/systemd/systemd ro loglevel=3 quiet no-vmw-sta loadpin.enabled=0 slub_debug=- page_poison=off slab_nomerge cgroup.memory=nokmem pti=off l1tf=off mds=off isolcpus=1,2,3,4,5,6,7 net.ifnames=0 plymouth.enable=0 systemd.legacy_systemd_cgroup_controller=yes fips=0 audit=0
2023-08-18T02:36:50.227Z In(05) vcpu-4 - Guest: <6>[    0.000000] Disabled fast string operations 



Environment

VMware HCX

Cause

NE VM vmkernel panic / crash

Resolution

In order to determine the root cash of the NE VM vmkernel panic / crash, the NE VM will need to be suspended in vCenter while it is in the problem state so that a memory dump can be collected for analysis by the engineering team
 
NOTE: The ability to suspend an NE VM in vCenter is disabled by default. The configuration change procedure within vCenter that allows the NE VM to be suspended requires a power off/on of the NE VM in which case the current memory state will be lost. These means that once the procedure to allow NE VM suspension is done, you will have to experience a new occurrence of the vmkernel panic / crash on that NE VM in order to suspend it and collect the memory dump file.

 

If you believe you have experienced this issue, please provide the below information and reference this KB article when opening a support case with Broadcom. 

  • HCX Build being used in both Source/Destination sites.
  • Approx. uptime of NE VM. 
  • When was the last upgrade to HCX performed?
  • Name of affected Service-Mesh.
  • Name of affected NE appliance. 
  • Screenshot of error message from VC UI as shown above. 

 

 

Additional Information

The NE VM remains offline until it it is manually recovered in vCenter by a power off/on or reset