Understanding a "Lost Heartbeat" purple diagnostic screen

search cancel

Understanding a "Lost Heartbeat" purple diagnostic screen

book

Article ID: 308590

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides a high-level overview of the causes of a Lost Heartbeat purple diagnostic screen and provides information to help you identify the underlying cause.

A Lost Heartbeat purple diagnostic screen is a symptom of a larger issue. Determining the underlying cause of the purple diagnostic screen is non-trivial. Neither the vmkernel log or Linux OOPS messages log reflect the underlying cause. To prevent re-occurrence of the outage, the underlying root cause must be identified.

Symptoms:

An ESX host fails with a purple diagnostic screen
The purple diagnostic screen contains this text:

Lost Heartbeat
The purple diagnostics screen appears similar to:

Note: Most of the information in a purple diagnostics screen is boilerplate and does not reflect a specific root cause. Unique information contained within the purple diagnostics screen is highlighted in the following example.

VMware ESX Server [Releasebuild-123630]
Lost heartbeat
frame=0x1f4fa34 ip=0xc014b306 cr2=0xb6bc5054 cr3=0x34401000 cr4=0x6f0
es=0xc02a0068 ds=0x690068 fs=0x0 gs=0x0
eax=0x2dc94 ebx=0xc1609228 ecx=0xc1bb9938 edx=0xc0396c80
ebp=0x2e3d2 esi=0xc160920c edi=0x2dc95 err=0 eflags=0x0
*0:1024/console 1:1367/vmware-vm 2:1125/vcpu-0:VM 3:1351/vmware-vm
4:1384/mks:VM123 5:1029/idle5 6:1321/mks:VM456 7:1166/mks:VM789
8:1120/vmm0:VMAB 9:1150/vmm0:VMCD 10:1215/vmware-vm 11:1035/idle11
12:1237/Worker#0: 13:1380/vmm0:VMEF 14:1211/vmware-vm 15:1170/mks:VMGHI
@BlueScreen: Lost heartbeat
0x2e3d2:[0xc014b306]blk_dev+0xbd8524a5 stack: 0x0, 0x0, 0x0
VMK uptime: 42:14:03:58.308 TSC: 11010722567976450
42:14:02:29.347 cpu0:1024)VMNIX: ALERT: HB: 365: Lost heartbeat (comm=vmnixhbd pid=20 t=29 to=30 clt=1).
42:14:03:29.341 cpu0:1024)VMNIX: ALERT: HB: 365: Lost heartbeat (comm=VMap pid=9173 t=30 to=30 clt=1).
42:14:03:29.508 cpu0:1024)Host: 4781: COS Error: Lost heartbeat
Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log
42:14:03:29.517 cpu7:1047)Dump: 354: Dumping core to '/volumes/4533e089-1dfcdd21-f1d3-001a140ebdca/cos-core-esxhostname'
The cos-core file is written to a datastore, usually named /vmfs/volumes/UUID/cos-core-hostname. For example:

VMware ESX Server [Releasebuild-123630]
Dumping console os core. This may take up to an hour.

Note: While writing the Service Console (COS) coredump, the VMFS filesystems remain active and open files remain locked. This process can take up to an hour. Check the named datastore for the cos-core file. If you engage VMware Tech Support, provide this file in addition to normal logs.
A stack trace from the COS log typically mentions processes that are not necessarily responsible for the crash. The stack call trace usually describes memory operations such as swapping or rebalancing. For an example of a stack trace from the Service Console, see Decoding the purple screens - Service Console OOPS/Panic (1006802).

Note: A lot of the stack trace is boilerplate. Most lost heartbeat purple diagnostic screens reference memory operations (such as refill_inactive_zone, rebalance_inactive do_try_to_free_pages alloc_pages).

If you encounter a purple diagnostic screen that does not match the symptoms above, see Interpreting an ESX host purple diagnostic screen (1004250).

Environment

VMware ESX Server 3.0.x
VMware ESX 4.1.x
VMware ESX 4.0.x
VMware ESX Server 3.5.x

Resolution

The ESX VMkernel and the Service Console Linux kernel run at the same time on ESX. The Service Console Linux kernel runs a process called vmnixhbd, which heartbeats the VMkernel as long as it is able to allocate and free a page of memory. If no heartbeats are received before a timeout period of 30 minutes, the VMkernel triggers a COS Oops and a purple diagnostics screen that mentions a Lost Heartbeat. This timeout period is not configurable.

There are numerous possible reasons why a VMkernel does not receive a heartbeat. The Service Console becoming unresponsive is the most common cause. An unresponsive Service Console is usually caused by a memory or CPU contention in the Service Console. This could be caused by a memory leak in a process, third party software with a large memory footprint, or excessive numbers of processes running on the Service Console. To prevent this, be aware of things that can contribute to memory exhaustion. Other possible causes include a failed heartbeat daemon, a broken heartbeat mechanism, RPC-hang, or a VMkernel process thread scheduling issue.

Historical information from /var/log/vmksummary

The /var/log/vmksummary* files are written to once an hour. They record the amount of swap consumed and the highest three memory consumers in the Service Console at that time. The memory footprint of these top-three processes are measured in kilobytes (KB). This provides both the consumption in the hour before the purple diagnostic screen and a historical trend for these metrics. This information is always included in a vm-support log bundle, regardless of whether a Service Console coredump is available. Memory leaks can be slow and are reflected in this file. However, if memory usage spikes rapidly within an hour, this log may not have a record of it.

Here is an example of a /var/log/vmksummary file:

Nov 13 03:01:35 esxhostname logger: (1258099291) hb: vmk loaded, 11795301.25, 11795288.970, 73, 153875, 153875, 567468032, VMap-583524, vmware-h-82812, webAcces-27332
Nov 13 04:01:45 esxhostname logger: (1258102905) hb: vmk loaded, 11798915.64, 11798902.253, 73, 153875, 153875, 567468032, VMap-586196, vmware-h-83180, webAcces-27292
Nov 13 07:13:58 esxhostname vmkhalt: (1258114439) Starting system…

Where:

11798915.64 is the uptime of the host in seconds. In this example, the host was up for 136 days.
73 is the number of virtual machines running. In this example, 73 virtual machines were affected by the failure.
567468032 is the bytes of swap consumed. In this example, 541MB of swap was consumed.
VMap-586196 is the amount of virtual memory (measured in KB) that the VMap service (which is part of VMware HA) was consuming. In this example, the VMap service was consuming 572MB of virtual memory.
vmware-h-83180 is the amount of virtual memory (measured in KB) that vmware-hostd (which is a core management agent). In this example, vmware-hostd was consuming 81MB of virtual memory.
webAcces-27292 is the amount of virtual memory (measured in KB) that webAccess (which is used for browser-based virtual management) was consuming. In this example, webAccess was using 2 6MB of virtual memory.

In this example, at the last hour, the VMap component of VMware HA was consuming an excessive amount of memory and a lot of swap was consumed. Looking backward in the vmksummary log file shows a gradual increase in the memory footprint of the VMap process. This is suggestive of a memory leak.

If the vmksummary log does not show an obvious cause, the memory consumption may have spiked from normal usage to very high within the hour, or it may not have been a single process that caused the problem.

The COS coredump

The Lost Heartbeat purple diagnostic screen writes two coredumps to disk:

The file vmkernel-zdump-111309.07.13.1 contains both a VMkernel coredump and log file, and is named after the date and time of the next successful startup. It is placed in /root/ or /var/core/, and rotated to /root/old_cores/ after collection.
The file cos-core-esxhostname.123.core.gz contains the Service Console coredump and is named after the ESX host. It is placed in /root/old_cores/ or /var/core/.

During startup, the ESX host moves the coredumps to their final location. It is possible that the move operation can fail, so both locations should be checked as analysis cannot proceed without these files.

Through analysis of the cos-coredump file, VMware Tech Support may be able to determine what processes were running at the time of the crash and the memory footprint of each. This information is not historic, but is reflective of state at the time the COS World was stunned for writing the coredump.

If the information in /var/log/vmksummary is not conclusive, engage Tech Support and provide logs and both coredump files. For more information, see Collecting diagnostic information for VMware ESX/ESXi (653).

Memory Allocation to the Service Console

Contrast the information from /var/log/vmksummary with the actual memory allocations (RAM and Swap) made for the Service Console. On a working system, look at /proc/meminfo. This is included in the vm-support log bundle. Here is an example of actual memory allocations in /proc/meminfo:

MemTotal:799732 kB
HighTotal:0 kB
LowTotal:799732 kB
SwapTotal:554168 kB

In this example, there is 554168KB (541MB) of swap and 799732KB (780MB) of RAM assigned to the Service Console. These values reflect configured maximums. The consumption information here (free/used) is post-reboot and is not reflective of the state of the system at the time of the outage.

Comparing these values with information known from /var/log/vmksummary, you can determine that 100% of swap was consumed (567468032 bytes or 541MB of 554168KB or 541MB) and that 69% of RAM was consumed by a single process (586196KB or 572MB of 799732KB or 780MB).

Action Plan

If you experience a Lost Heartbeat purple diagnostic screen, consider performing these options:

In a given incident, the process may be revealed as a component shipped with ESX/VirtualCenter (cimserver, snmpd, vmware-hostd, vmap, etc) or a third-party agent. If there is a memory leak, increasing the amount of memory assigned to the Service Console does not typically prevent memory exhaustion and the purple diagnostic screen, but rather only delays it. If memory usage is generally static but under stress, VMware recommends increasing the amount of RAM or swap assigned to the Service Console. For more information, see Increasing the amount of RAM assigned to the ESX Server service console (1003501).
Once the cause of the memory exhaustion is determined, the next steps to be taken depend on the process at fault. If third-party software is the cause, determine where it came from and consider removing it as per the Third Party Hardware and Software Support Policy. If a VMware component is responsible for leaking memory, engage VMware Technical Support for further investigation. For more information, see How to Submit a Support Request.

Note: Workarounds typically involve disabling the component that is causing the memory leak.
In anticipation of a re-occurrence, run top in batch mode to gather historic process CPU and memory consumption statistics within the Service Console. For more information, see Using performance collection tools to gather data for fault analysis (1006797).
If you do not have a cos-core file, ensure that the Misc.CosCoreFile advanced configuration option is valid. For more information, see Configuring an ESX host to capture a Service Console coredump (1032962).

Additional Information

If you have a "Lost Heartbeat" purple diagnostic screen that exactly matches the symptoms outlined in Pegasus (cimserver) memory leaks reported in ESX 3.5 Update 2 and later (1009607), follow the directions in that article.

Note: If the error has not been documented within the Knowledge Base, collect diagnostic information from the ESX host and submit a support request. For more information, see Collecting Diagnostic Information for VMware Products (1008524) and How to Submit a Support Request.

「ハートビート消失」による紫色の診断画面について
Collecting diagnostic information for VMware ESX/ESXi

Feedback

thumb_up Yes

thumb_down No