vSphere 9.0 adds the feature of monitoring of a remote boot device containing ESX-OSData partitions for critical failures. vSphere expects such a remote boot device to have high availability but the device can still fail due to various reasons such as 'All Paths Down' or 'Permanent device loss' etc. For a remote boot device, these situations are continuously monitored in ESX. When such situations occur and if the device fails to recover within a certain interval, this is a critical error. ESX host will be halted so as to avoid running it in an failed state and avoid corruption. In addition, the vCenter Server may receive VMkernel Observations (VOBs) system events for these failures. The following failure scenarios are monitored by vSphere.
**CRITICAL**: Lost access to boot device 'eui.xxxxxxx - All paths are down.
Module(s) involved in panic: [bootdevmon Built on:...]
Event Message | Event Type | Event ID | Note |
Boot Device with identifier 'eui.xxxxxxxxxxxx' has entered the state: All Paths Down. Host will be halted if the boot device fails to recover in 160 seconds. |
warning |
esx.problem.bootdevice.apd.start |
|
Boot Device with identifier 'eui.xxxxxxxxxxxx' has entered the state: All Paths Down Timeout. Host will be halted if the boot device fails to recover in 20 seconds. |
error |
esx.problem.bootdevice.apd.timeout |
|
Boot Device with identifier 'eui.xxxxxxxxxxxx' has exited from the state: All Paths Down. |
info |
esx.clear.bootdevice.apd.exit |
If the boot device is recovered from APD state in 160 seconds. |
A storage device is considered to be in the permanent device loss (PDL) state when it becomes permanently unavailable to your ESX host. Typically, the PDL condition occurs when a device is unintentionally removed, or its unique ID changes, or when the device experiences an unrecoverable hardware error. When a PDL occurs and the boot device fails to recover, the ESX host will be halted with the below purple diagnostic screen.
**CRITICAL**: Lost access to boot device 'eui.xxxxxxxxxxxx' - Permanent device loss.
Module(s) involved in panic: [bootdevmon Built on:...]
The vCenter Server may receive the following VOB events from the ESX host:
Event Message | Event Type | Event ID | Note |
Boot Device with identifier 'eui.xxxxxxxxxxxx' has entered the state: Permanent Device Loss. Host will be halted if the boot device fails to recover in 20 seconds. |
error |
esx.problem.bootdevice.pdl |
|
Boot Device with identifier 'eui.xxxxxxxxxxxx' is accessible again. |
info |
esx.clear.bootdevice.pdl.restored |
If the boot device is recovered from PDL state in 20 seconds. |
During boot, ESX checks the availability of the system storage in the boot device (local or remote). If for any reason the boot device is not found or inaccessible, ESX will stop proceeding with the boot and display a purple diagnostic screen as seen below if the host was not in maintenance mode before reboot.The system has found a problem on your machine and cannot continue.
Unable to find boot device: '<Device ID>'.
Stay in the maintenance mode if ESX was in the maintenance mode before reboot. On ESX host, there is a SysAlert "Failed to find boot device after 120 seconds.". On vSphere Client, the ESX host is kept into maintenance mode, and "Exit maintenance mode" operation will fail with an error “A general system error occurred: Cannot exit maintenance mode due to failure during boot. A critical failure was detected during system boot. The host is currently not able to exit maintenance mode and run workloads. Refer to VMware KB93107 for details
”.
ESX 9.0
This problem occurs when the boot device is inaccessible during ESX boot. This can occur for various reasons such as permanent device loss, misconfiguration of ESX network/storage settings, connectivity issues with the fabric, or problems with the Storage Array.
To resolve this issue, identify and resolve the cause for the Storage connectivity failure, such as Storage array, SAN switch, Device failure, etc.
It may be necessary to temporarily disable boot device monitoring for investigating any host side issues and you may follow below steps to disable it temporarily :
Note: For debugging boot device issues, the boot device monitoring can be disabled as follows. Once the issue is resolved, the boot device monitoring should be enabled.
Temporary disabling:
By adding the boot option 'haltOnBootDeviceLoss=FALSE' as follows,
Event Message | Event Type | Event ID |
Host is not in compliance, remote boot device monitoring is disabled. |
warning |
esx.problem.bootdevice.monitor.disabled |
Persistent disabling (Persistent when boot device is accessible):
# esxcli system settings kernel set -s haltOnBootDeviceLoss -v FALSE
Event Message | Event Type | Event ID |
Host is not in compliance, remote boot device monitoring is disabled. |
warning |
esx.problem.bootdevice.monitor.disabled |
# esxcli system settings kernel set -s haltOnBootDeviceLoss -v TRUE