Excessive hardware alarms may be triggered when sensors are reset to an unknown state
search cancel

Excessive hardware alarms may be triggered when sensors are reset to an unknown state

book

Article ID: 338058

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
In ESXi 6.5 and 6.7, there may be events like below when the wbem service is not able get the hardware state.

[YYYY-MM-DDTHH:MM:SS]Z error hostd[2099863] [Originator@6876 sub=Default opID=4d3fd9c8-85-2e67 user=vpxuser:management] IpmiIfcSdrReadRecordId:record id: 3D, error 192. Try again... 
[YYYY-MM-DDTHH:MM:SS]Z error hostd[2099863] [Originator@6876 sub=Default opID=4d3fd9c8-85-2e67 user=vpxuser:management] IpmiIfcSdrReadRecordId: retry expired. 
[YYYY-MM-DDTHH:MM:SS]Z warning hostd[2099863] [Originator@6876 sub=Default opID=4d3fd9c8-85-2e67 user=vpxuser:management] IpmiIfcSensorGetReading: Sensor Number 0x87, failed send cc = 0xc0   
[YYYY-MM-DDTHH:MM:SS]Z warning hostd[2099863] [Originator@6876 sub=Cimsvc opID=4d3fd9c8-85-2e67 user=vpxuser:management] Retrieve Health status failed, sensors reset to unknown state ==> All sensors get reset to Unknown. 
[YYYY-MM-DDTHH:MM:SS]Z verbose hostd[2099863] [Originator@6876 sub=Default opID=4d3fd9c8-85-2e67 user=vpxuser:management] count_events: starting communication with bmc over ipmi driver ==> Loading SEL data from IPMI.
[YYYY-MM-DDTHH:MM:SS]Z error hostd[2099863] [Originator@6876 sub=Default opID=4d3fd9c8-85-2e67 user=vpxuser:management] count_events: ipmi returned invalid data block: data_len: 1 ccode 192  ==> Failure due to IPMI node busy. 
[YYYY-MM-DDTHH:MM:SS]Z error hostd[2099863] [Originator@6876 sub=Default opID=4d3fd9c8-85-2e67 user=vpxuser:management] sync_device_eventlog: communicate with bmc failed, no hardware sel data.

Environment

VMware vSphere ESXi 6.7
VMware vSphere ESXi 6.5

Cause

The events are generated when the error code returned from IPMI shows that queries have failed as the IPMI node was busy. When the cimsvc service fails to fetch the data, it resets all sensors to an Unknown state.

Resolution

This issue has been resolved in VMware ESXi 6.7, Patch Release ESXi670-202008001

Workaround:
To workaround this issue apply one of the following options.

Option 1:
Disable wbem using the following command:

$ 'esxcli system wbem set -e 0'
 
If wbem is disabled, numeric sensor data will be refreshed via LoadStatusFromIPMI() which does not reset the sensor states to unknown; and will not cause excessive false alarms for numeric sensors. If wbem is disabled, any queries such as getClass, getInstance, enumInstances, etc.; to sfcbd will no longer work. 

VMware vSphere 6.5 can report IPMI data with or without wbem services running. The default is for wbem services off on a new install.

Option 2:
Disable CIMSVC plug-in from hostd. 

Disabling CIMSVC will prevent the plugin from polling IPMI for hardware health information. Health status data would be not be monitored or reported for any of the sensors. However as Wbem(sfcb) is running queries sucg as getClass, getInstance, enumInstances, etc; will continue to work.

Place the host in maintenance mode before proceeding. 
  • SSH to the ESXi host. 
  • Run the following command to stop the hostd process:
    $ /etc/init.d/hostd stop

  • Take a backup of the /etc/vmware/hostd/config.xml file. 
  • Edit the file and change the following value from true to false. 
<cimsvc>
           <path>libcimsvc.so</path>
           <enabled>true</enabled>
            </cimsvc>
  • Start the hostd process again:
    $ /etc/init.d/hostd start

Additional Information

Impact/Risks:
Turning CIMSVC off will stop the plugin from polling for IPMI data and prevent reporting of hardware health information.