Remote Boot Device Failure Monitoring
search cancel

Remote Boot Device Failure Monitoring

book

Article ID: 415836

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

  • A remote boot device is monitored for critical failures.
  • ESX expects such a remote boot device to have high availability but the device can still fail due to various reasons such as 'All Paths Down' or 'Permanent device loss' etc.
  • For a remote boot device, these situations are continuously monitored in ESX.
  • When such situations occur and if the device fails to recover within a certain interval, this is a critical error.
  • ESX sends out system alerts and VMkernel Observations (VOBs).
  • The vCenter Server may receive VOB events for these failures.

 

  • The following failure scenarios are monitored by ESX:

All Paths Down (APD)

  • An All-Paths-Down (APD) situation occurs when all paths to the boot device are down.
  • This situation begins with a start of the 'All-Paths-Down' event.
  • The  ESX host enters a timeout period (140 secs default) and keeps reattempting to establish connectivity to the boot device.
  • When this timeout period ends and the boot device failed to recover, an 'All Paths Down' timeout event occurs. 
  • The vCenter Server may receive the following VOB events from the ESX host:

Boot device 'eui.xxxxxxxxxxxx' is not accessible. Current state: All Paths Down.
Boot device 'eui.xxxxxxxxxxxx' is not accessible. Current state: All Paths Down Timeout.
Boot device 'eui.xxxxxxxxxxxx' is accessible now. Recovered from All Paths Down state.


For more information on this refer All Paths Down for a storage device

Permanent Device Loss (PDL)

  • A storage device is considered to be in the permanent device loss (PDL) state when it becomes permanently unavailable to your ESX host.
  • Typically, the PDL condition occurs when a device is unintentionally removed, or its unique ID changes, or when the device experiences an unrecoverable hardware error.
  • When a PDL occurs, ESX sends out periodic system alerts including VOB messages indicating this error
  • The vCenter Server may receive the following VOB events from the ESX host:

Boot device 'eui.xxxxxxxxxxxx' is not accessible. Current state: Permanent Device Loss
Boot device 'eui.xxxxxxxxxxxx' is accessible now. Recovered from Permanent Device Loss state

 

Remote boot device loss on ESX boot

Host booting with maintenance mode disabled:

  • During boot, ESX checks the availability of the system storage in the boot device (local or remote).
  • If for any reason the boot device is not found or inaccessible, ESX will stop proceeding with the boot and display a purple diagnostic screen with a backtrace similar to below: 

The system has found a problem on your machine and cannot continue.

Boot device containing volume '<uuid>' is not accessible.

Host booting with maintenance mode enabled:

  • ESX continues to be in maintenance mode if it was in maintenance mode before reboot.
  • On ESX host, there is a SysAlert

Failed to find boot device after 120 seconds

  • On vSphere Client, the ESX host is kept into maintenance mode, and "Exit maintenance mode" operation will fail with an error

A general system error occurred: Cannot exit maintenance mode due to failure during boot. A critical failure was detected during system boot. The host is currently not able to exit maintenance mode and run workloads

Environment

ESX 9.1

Cause

  • This problem occurs when the boot device is inaccessible during ESX boot or during ESX runtime.
  • This can occur for various reasons such as permanent device loss, misconfiguration of ESX network/storage settings, connectivity issues with the fabric, or problems with the Storage Array

Resolution

  • To resolve this issue, identify and resolve the cause for the storage connectivity failure to the boot device, such as Storage array, SAN switch, Device failure, etc.
  • The ESX host may require a reboot to remove any residual references to the affected boot device.
  • For any reason, the boot device failure monitoring can be disabled temporarily as follows:

Disabling Temporarily:

1. Add the boot option 'systemStorageFailureMonitoringEnabled=FALSE':

    1. Power on the ESX host.
    2. When the ESX boot loader window appears, press Shift+O to edit boot options.
    3. Add the text systemStorageFailureMonitoringEnabled=FALSE
    4. Hit <Enter> to proceed with the boot.

2. Once the boot device monitoring is disabled. vCenter Server may receive the following VOB event from the ESX host when the host is booting.

  Host is not in compliance, remote boot device monitoring is disabled.

3. Debug the boot device failure issue.
4. Reboot the host to enable boot device monitoring again

 

Persistent disabling (Persistent when boot device is accessible):

1. Put the host in maintenance
2. SSH to the host and execute following esxcli command

  # esxcli system settings kernel set -s systemStorageFailureMonitoringEnabled -v FALSE

3. Once the boot device monitoring is disabled. vCenter Server may receive the following VOB event from the ESX host when the host is booting:

  Host is not in compliance, remote boot device monitoring is disabled

3. Debug the boot device failure issue.
4. Re-enable boot device failure monitoring (Persistent when boot device is accessible).

  # esxcli system settings kernel set -s systemStorageFailureMonitoringEnabled -v TRUE

5. Reboot the host