HCX appliances fail unexpectedly due to a kernel-level panic
search cancel

HCX appliances fail unexpectedly due to a kernel-level panic

book

Article ID: 406405

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

  • The HCX Network Extension (NE) appliance may crash unexpectedly and reboot due to a kernel panic, which results in the NE appliance losing its functionality.
  • The appliance would reboot within 60 seconds after the crash and auto-recover.
  • From the vmware logs of the affected Appliance VM (Logs are present in ESXi), you will notice "get_rps_cpu+0x679/0x860" trace in vmware.log file as below:
    • Login to the ESXi host as user root via SSH 
    • Execute the command esxcli vm process list |grep -i <HCX-NE-VM-Name>
    • Note down the location and execute the command : Example : cd /vmfs/volume/datastore-volume-uuid/HCX-NE-VM-folder/
    • Access the file vmware.log of the corresponding HCX-NE-VM
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64] Call Trace:
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  <IRQ>
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? __die_body.cold+0x1a/0x1f
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? die_addr+0x3d/0x67
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? exc_general_protection+0x150/0x327
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? asm_exc_general_protection+0x27/0x30
      <timstamps>.143Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? get_rps_cpu+0x679/0x860
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  netif_receive_skb_list_internal+0x261/0x2d7
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  ? vmxnet3_rq_rx_complete+0x62c/0xff0
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  napi_complete_done+0x74/0x197
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  vmxnet3_poll_rx_only+0x89/0xb7
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  __napi_poll+0x41/0x180
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  net_rx_action+0x3f9/0x4b7
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  handle_softirqs+0x9b/0x250
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  __irq_exit_rcu+0x90/0xc0
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  irq_exit_rcu+0xe/0x17
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  common_interrupt+0x8e/0xa7
      <timstamps>.144Z In(05) vcpu-1 - Guest: <4>[172###.9###64]  </IRQ>
  • All the network traffic utilizing the affected NE appliance would be experiencing traffic interruption between on-premises and Cloud sites, and will be restored after the appliance reboots.

Environment

HCX version 4.4 and above

Resolution

A kernel fix has been identified and the appropriate fix will be implemented in upcoming HCX maintenance build.

If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.