Hostd crashes on HPE ProLiant DL360 Gen11 servers

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Hostd may crash due to heap corruption, and the stack backtrace of the crash thread may vary. This issue has been observed only on HPE ProLiant DL360 Gen11 servers, but may not be limited to them.
vCenter server triggering alarms for multiple ESXi host with hostd crash
Check the logs on ESXi host in these locations /var/run/log/vobd.log and /var/run/log/hostd.log

/var/run/log/vobd.log
2023-09-11T15:12:57.913Z test vobd:  An event (esx.problem.hostd.core.dumped) could not be sent immediately to hostd; queueing for retry.
source: testesx.com event_type: v4_506f43c3 vmw_cluster: PMDC_General vmw_datacenter: testdc vmw_object_id: host-580771 vmw_vcenter: testvc.com vmw_vcenter_id: 8cf73c94-b6c6-4539-bbec-522df07af5ea vmw_vr_ops_id: f45405a6-9145-443e-8a28-b3494b1310be hostname: esxtest appname: vobd vmw_esxi_problem: hostd.core.dumped
Sep 11, 202311:12:57.930
2023-09-11T15:12:57.913Z test vobd:  [UserWorldCorrelator] 532749575830us: [esx.problem.hostd.core.dumped] /bin/hostd crashed (42 time(s) so far) and a core file may have been created at /var/core/hostd-zdump.000. This may have caused connections to the host to be dropped.
source: testesx event_type: v4_9dab9697 vmw_cluster: PMDC_General vmw_datacenter: testdc vmw_object_id: host-580771 vmw_vcenter: testvc.com vmw_vcenter_id: 8cf73c94-b6c6-4539-bbec-522df07af5ea vmw_vr_ops_id: f45405a6-9145-443e-8a28-b3494b1310be hostname: testesx appname: vobd vmw_vob_component: UserWorldCorrelator vmw_esxi_problem: hostd.core.dumped vmw_vob_event_type

/var/run/log/hostd.log
2023-09-07T07:44:47.355Z info hostd[2591565] [Originator@6876 sub=Default] hostd-2575365-21424296.txt time=2023-09-07 07:44:40.000
--> Crash Report build=21424296
--> Signal 11 received, si_code 1, si_errno 0
-->    Bad access at 19A27866DF
--> eip 0x19a27866df esp 0x19aacc9fc0 ebp 0xf00000000000082
--> eax 0xf00000000000020 ebx 0xf100001955646840 ecx 0x19aaccb6c8 edx 0x61 esi 0x800 edi 0x19a2ab8660
--> r8 0x19a2ab86b8 r9 0x19a2ab8688 r10 0x19a2ab86b0 r11 0x0 r12 0x19556468c2 r13 0x19a2ab8660 r14 0x195565b3f0 r15 0x2448000000000000
--> stack 19AACC9FC0 : 0xaacca070 0x00000019 0x94d6df21 0x00000019
--> stack 19AACC9FD0 : 0xaacca0f0 0x00000019 0x00000050 0x00000000
--> stack 19AACC9FE0 : 0xaacca100 0x00000019 0xa2ab8660 0x00000019
--> stack 19AACC9FF0 : 0xa2ab8660 0x00000019 0x00000000 0x00000000
--> stack 19AACCA000 : 0x00000810 0x00000000 0xaacca520 0x00000019
--> stack 19AACCA010 : 0x5561df60 0x00000019 0xa2788a4e 0x00000019
--> stack 19AACCA020 : 0x0000000c 0x0000000f 0x00000800 0x00000000
--> stack 19AACCA030 : 0x0000000c 0x0000000f 0x00000050 0x00000000
--> Backtrace:
--> Backtrace[0] 00000019aacc7fa0 rip=0000001998ccf719 rbx=00000019aacca040 rbp=00000019aacc9000 r12=00000019aacc7fb0 r13=0000000000000000 r14=0000001998dbf600 r15=00000019aacca040
--> Backtrace[1] 00000019aacc9010 rip=0000001998ccfb1f rbx=f100001955646840 rbp=00000019aacc9020 r12=00000019aacc90b0 r13=00000019a2ab8660 r14=000000195565b3f0 r15=2448000000000000
--> Backtrace[2] 00000019aacc9030 rip=000000000038a00f rbx=f100001955646840 rbp=00000019aacc9f30 r12=00000019aacc90b0 r13=00000019a2ab8660 r14=000000195565b3f0 r15=2448000000000000
--> SymBacktrace[0] 00000019aacc7fa0 rip=0000001998ccf719 in function (null) in object /lib64/libvmacore.so loaded at 00000019988e6000
--> SymBacktrace[1] 00000019aacc9010 rip=0000001998ccfb1f in function (null) in object /lib64/libvmacore.so loaded at 00000019988e6000
--> SymBacktrace[2] 00000019aacc9030 rip=000000000038a00f
--> Crash Report complete
--> hostd-map-25xxx5.txt time=2023-09-07 07:44:47.000

Environment

VMware vSphere 8.02
VMware vSphere 7.0.x

Cause

The BMC on HPE ProLiant DL360 Gen11 servers may respond with unsupported data, causing hostd to handle it improperly. This leads to heap chunk corruption. When a thread accesses this corrupted chunk, it may access invalid memory addresses, resulting in a hostd crash.

Resolution

This fix for this issue will be included in the next available ESXi 7.x and 8.x patch release.

Workaround:
The workaround is to disable the hostd cimsvc plugin and avoid running the "esxcli hardware ipmi" commands. Please note that disabling the cimsvc plugin will turn-off the hardware health monitoring.

Please find the steps to disable the hostd cimsvc plugin below:

1. Run the command to create a temporary JSON file with hostd config:

  /bin/configstorecli config current get -c esx -g services -k hostd -outfile tmp.json

2. Run the command to edit the file:

/bin/vi tmp.json

Edit the following line in this file:

   "cimsvc": {
      "enabled": true
   },
   to
   "cimsvc": {
      "enabled": false
   },

3. Run the command to apply the file to the configstore database:

  /bin/configstorecli config current set -c esx -g services -k hostd -infile tmp.json

4. Run the command to restart hostd service:

  /etc/init.d/hostd restart