The ESX VMkernel and the Service Console Linux kernel run at the same time on ESX. The Service Console Linux kernel runs a process called vmnixhbd, which heartbeats the VMkernel as long as it is able to allocate and free a page of memory. If no heartbeats are received before a timeout period of 30 minutes, the VMkernel triggers a COS Oops and a purple diagnostics screen that mentions a Lost Heartbeat. This timeout period is not configurable.
There are numerous possible reasons why a VMkernel does not receive a heartbeat. The Service Console becoming unresponsive is the most common cause. An unresponsive Service Console is usually caused by a memory or CPU contention in the Service Console. This could be caused by a memory leak in a process, third party software with a large memory footprint, or excessive numbers of processes running on the Service Console. To prevent this, be aware of things that can contribute to memory exhaustion. Other possible causes include a failed heartbeat daemon, a broken heartbeat mechanism, RPC-hang, or a VMkernel process thread scheduling issue.
Historical information from /var/log/vmksummary
The /var/log/vmksummary* files are written to once an hour. They record the amount of swap consumed and the highest three memory consumers in the Service Console at that time. The memory footprint of these top-three processes are measured in kilobytes (KB). This provides both the consumption in the hour before the purple diagnostic screen and a historical trend for these metrics. This information is always included in a vm-support log bundle, regardless of whether a Service Console coredump is available. Memory leaks can be slow and are reflected in this file. However, if memory usage spikes rapidly within an hour, this log may not have a record of it.
Here is an example of a /var/log/vmksummary file:
Nov 13 03:01:35 esxhostname logger: (1258099291) hb: vmk loaded, 11795301.25, 11795288.970, 73, 153875, 153875, 567468032, VMap-583524, vmware-h-82812, webAcces-27332
Nov 13 04:01:45 esxhostname logger: (1258102905) hb: vmk loaded, 11798915.64, 11798902.253, 73, 153875, 153875, 567468032, VMap-586196, vmware-h-83180, webAcces-27292
Nov 13 07:13:58 esxhostname vmkhalt: (1258114439) Starting system…
Where:
- 11798915.64 is the uptime of the host in seconds. In this example, the host was up for 136 days.
- 73 is the number of virtual machines running. In this example, 73 virtual machines were affected by the failure.
- 567468032 is the bytes of swap consumed. In this example, 541MB of swap was consumed.
- VMap-586196 is the amount of virtual memory (measured in KB) that the VMap service (which is part of VMware HA) was consuming. In this example, the VMap service was consuming 572MB of virtual memory.
- vmware-h-83180 is the amount of virtual memory (measured in KB) that vmware-hostd (which is a core management agent). In this example, vmware-hostd was consuming 81MB of virtual memory.
- webAcces-27292 is the amount of virtual memory (measured in KB) that webAccess (which is used for browser-based virtual management) was consuming. In this example, webAccess was using 2 6MB of virtual memory.
In this example, at the last hour, the VMap component of VMware HA was consuming an excessive amount of memory and a lot of swap was consumed. Looking backward in the vmksummary log file shows a gradual increase in the memory footprint of the VMap process. This is suggestive of a memory leak.
If the vmksummary log does not show an obvious cause, the memory consumption may have spiked from normal usage to very high within the hour, or it may not have been a single process that caused the problem.
The COS coredump
The Lost Heartbeat purple diagnostic screen writes two coredumps to disk:
- The file vmkernel-zdump-111309.07.13.1 contains both a VMkernel coredump and log file, and is named after the date and time of the next successful startup. It is placed in /root/ or /var/core/, and rotated to /root/old_cores/ after collection.
- The file cos-core-esxhostname.123.core.gz contains the Service Console coredump and is named after the ESX host. It is placed in /root/old_cores/ or /var/core/.
During startup, the ESX host moves the coredumps to their final location. It is possible that the move operation can fail, so both locations should be checked as analysis cannot proceed without these files.
Through analysis of the cos-coredump file, VMware Tech Support may be able to determine what processes were running at the time of the crash and the memory footprint of each. This information is not historic, but is reflective of state at the time the COS World was stunned for writing the coredump.
Memory Allocation to the Service Console
Contrast the information from /var/log/vmksummary with the actual memory allocations (RAM and Swap) made for the Service Console. On a working system, look at /proc/meminfo. This is included in the vm-support log bundle. Here is an example of actual memory allocations in /proc/meminfo:
MemTotal:799732 kB
HighTotal:0 kB
LowTotal:799732 kB
SwapTotal:554168 kB
In this example, there is 554168KB (541MB) of swap and 799732KB (780MB) of RAM assigned to the Service Console. These values reflect configured maximums. The consumption information here (free/used) is post-reboot and is not reflective of the state of the system at the time of the outage.
Comparing these values with information known from /var/log/vmksummary, you can determine that 100% of swap was consumed (567468032 bytes or 541MB of 554168KB or 541MB) and that 69% of RAM was consumed by a single process (586196KB or 572MB of 799732KB or 780MB).
Action Plan
If you experience a Lost Heartbeat purple diagnostic screen, consider performing these options:
- In a given incident, the process may be revealed as a component shipped with ESX/VirtualCenter (cimserver, snmpd, vmware-hostd, vmap, etc) or a third-party agent. If there is a memory leak, increasing the amount of memory assigned to the Service Console does not typically prevent memory exhaustion and the purple diagnostic screen, but rather only delays it. If memory usage is generally static but under stress, VMware recommends increasing the amount of RAM or swap assigned to the Service Console. For more information, see Increasing the amount of RAM assigned to the ESX Server service console (1003501).
- Once the cause of the memory exhaustion is determined, the next steps to be taken depend on the process at fault. If third-party software is the cause, determine where it came from and consider removing it as per the Third Party Hardware and Software Support Policy. If a VMware component is responsible for leaking memory, engage VMware Technical Support for further investigation. For more information, see How to Submit a Support Request.
Note: Workarounds typically involve disabling the component that is causing the memory leak.
- In anticipation of a re-occurrence, run top in batch mode to gather historic process CPU and memory consumption statistics within the Service Console. For more information, see Using performance collection tools to gather data for fault analysis (1006797).
- If you do not have a cos-core file, ensure that the Misc.CosCoreFile advanced configuration option is valid. For more information, see Configuring an ESX host to capture a Service Console coredump (1032962).