HP Automatic Server Recovery (ASR) in an ESX environment
search cancel

HP Automatic Server Recovery (ASR) in an ESX environment

book

Article ID: 311454

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about HP Automatic Server Recovery (ASR) and ASR events.

Symptoms:
If an ASR event occurs, the server hardware is restarted and IML shows the message:
An ASR has occurred


Environment

VMware ESX Server 2.5.x
VMware ESX 4.0.x
VMware ESX Server 1.5.x
VMware ESX Server 3.0.x
VMware ESX Server 1.x
VMware ESX Server 2.0.x
VMware ESX Server 3.5.x
VMware ESX Server 2.1.x
VMware ESX 4.1.x

Resolution

What is ASR's function?

ASR is an HP-provided capability enabling system reboots if the hardware determines that the operating system has become unresponsive. This can reduce downtime to business-facing applications.

How does it work?

ASR is comprised of two components on an ESX host: A hardware heartbeat timer and a Health Monitor running in the operating system. The heartbeat timer, by default, is set to 10 minutes and begins countdown from there. The Health Monitor is responsible for frequently reloading the timer. On ESX hosts, this agent runs in the Service Console and is usually installed as part of HP Insight Manager.

If the timer reaches zero, ASR assumes that the operating system has become unresponsive and reboots the server.

ASR in an ESX environment

ESX hosts are comprised of two primary components: The VMkernel and the Service Console. At a high level, the VMkernel is responsible for the overlying virtual machines, while the Service Console is responsible for providing a management interface to the ESX host. Agents can be installed in the Service Console to provide management instrumentation and other functions, including the Health Monitor.

The Health Monitor running in the Service Console may occasionally fail to reset the heartbeat timer. Reasons include:
  • A system failure
  • The Service Console on an ESX host has too high a load, preventing the Health Monitor from getting CPU time
  • A purple diagnostic screen error

When to disable ASR on ESX hosts

There are three primary arguments for disabling ASR on ESX hosts:
  1. Unintended virtual machine outages: If the heartbeat timer reaches zero as a result of a problem within the Service Console (for example, CPU or memory utilization or an agent failure), ASR may determine that the server has failed, even if the overlying virtual machines are still functioning. In this case, attempt to migrate the virtual machines off the host prior to a host restart. If ASR is enabled, the host is rebooted and the overlying virtual machines fail, resulting in an outage to business-facing applications that may have been avoided (by migrating or working with support) or minimized (by scheduling a maintenance window in the event that migration fails).

  2. Loss of diagnostic data: If the heartbeat timer reaches zero as a result of a purple diagnostic screen error and ASR reboots the system, it may become impossible to determine the root cause of the ASR reboot since diagnostic data related to the crash is lost upon restart. In addition, if there is a delay between the service console becoming unresponsive and the resultant purple diagnostic screen error, it is possible that ASR could reboot the system prior to the purple diagnostic screen error being generated. This could circumvent the generation of the purple diagnostic screen error and related diagnostic data. The purple diagnostic screen error contains a wealth of valuable information that can aid in pinpointing a root cause.

  3. Increasing the ASR timer may not help: The ASR timer can be increased from 10 to as high as 30 or 60 minutes. However, doing so may reduce ASR's effectiveness. Its intent is to minimize downtime, and 30 or 60 minutes is a long time for a system to be unresponsive without operator intervention. Further, even with a timer set that high, ASRs can still occur, which will impact the administrator's ability to troubleshoot the issue.
Reboots as a result of ASRs are always a symptom of the root cause, not the root cause itself. Incidentally, both HP and VMware engineers indicate that they troubleshoot ASR reboots by disabling it, thereby allowing them to gather diagnostic data the next time the crash occurs.

For more information on disabling the ASR feature, see the HP documentation or contact HP support.

Additional Information

ESX 环境中的 HP Automatic Server Recovery (ASR)