Interpreting an ESXi host purple diagnostic screen (PSOD)

search cancel

Interpreting an ESXi host purple diagnostic screen (PSOD)

book

Article ID: 343033

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information to decode ESXi host purple screen (PSOD) errors.

An ESXi purple screen error appears similar to:

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Resolution

What is the VMkernel?

The VMkernel is the operating system core of ESXi. The kernel handles resource scheduling and device IO. Device IO is handled by the VMware network and storage stacks, which serves as a layer between the virtual file system, network devices and the device drivers that control physical devices.

Interpreting the purple diagnostic screen

If the VMkernel experiences an error, the error appears in a purple diagnostic screen. The purple diagnostic screen looks similar to:

PCPU 1 locked up. Failed to ack TLB invalidate.
frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff
*0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc
0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48
0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0
0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2
0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0
0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0
0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0
0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0
VMK uptime: hh:mm:ss:ms TSC: 1751259712918392
Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log

Here is a breakdown of each section of the above purple diagnostic screen:

The Error Message:

PCPU 1 locked up. Failed to ack TLB invalidate
This section of the purple diagnostic screen identifies the reported error message. There are only a finite number of error messages that can be reported. These error messages are discussed in this article.
The CPU Registers:

frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff
These are the values that were in the physical CPU registers at the time of the error. The information in these registers may vary greatly between VMkernel errors. These registers can only be used internally when debugging a core dump of the VMkernel error. For more information about these registers, see Intel® 64 and IA-32 Architectures Software Developer Manuals for Intel and AMD64 Architecture Programmer's Manual, Volumes 1-5 for AMD.
The Physical CPU:

*0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc

This section of the purple diagnostic screen identifies the physical CPU that was running instructions during the VMkernel error. In the example, the * beside the 0 indicates that physical CPU 0 was running an operation at the time of the failure. In newer versions of ESXI, instead of including an *, the preceding letters CPU are included. For example, if the same error as the above were to occur in newer versions of VMware ESXI, the same line appears as:

CPU0:1037/helper1-4 cpu1:1107/vmm0:Fagi cpu2:1121/vmware-vm cpu3:1122/mks:Franc.
This section of the purple diagnostic screen also describes the world (process) that was running on the CPU at the time of the error. In the above example, the userworld running was helper1-4.

Note: The name of the process may be truncated.
The Stack Trace:

0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48 0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0 0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2 0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0 0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0 0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0 0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0

The stack represents what the VMkernel was doing at the time of the error. In this example, it was trying to clear memory page tables (TLB). This information is a vital tool in the diagnosis of purple screen errors by evaluating the actions of the kernel at the time of the error.
The Uptime:

VMK uptime: hh:mm:ss:ms TSC: 1751259712918392

This section indicates how long a server is running since the last boot. The TSC value is the number of CPU clock cycles that have elapsed since the server was started.
The Core Dump:

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log
This section of the purple diagnostic screen indicates that the contents of the VMkernel memory are being copied to the vmkcore partition.

Using the error message of the purple diagnostic screen to troubleshoot a vmkernel error

The VMkernel error message generated by the purple screen can be used to identify the cause of the issue. The number of error messages that can be produced are finite. This is a list of known VMkernel error messages.

Type: Spin count exceeded / Possible deadlock
Example Error: Spin count exceeded (iplLock) - possible deadlock
Description: A VMware ESXi host may report a Spin count exceeded and possible deadlock in a purple diagnostic screen when a thread is attempting to execute in the critical section of code. Since it was trying to enter the critical section, the thread needed to poll a mutex for a lock prior to executing the code by conducting a spinlock operation. The thread continues to poll the mutex during the spinlock operation, but there is a certain limit of how many times it polls the mutex.
Type: Failed to ack TLB invalidate
Example Error: PCPU 1 locked up. Failed to ack TLB invalidate.
Description: Physical CPUs fail when trying to clear memory page tables.

A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). This is a list of common exceptions:

Type: Exception 13 (General Protection Fault)
Example Error: #GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303
Description: A general protection fault (Exception 13) occurs under one of the following circumstances: the page being requested does not belong to the program requesting it (and not mapped in program memory), or the program does not have rights to perform a read or write operation on the page.
Type: Exception 14 (Page Fault)
Example Error: #PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e
Description: A page fault (Exception 14) occurs when the page being requested has not been successfully loaded into memory.

For more information on Exception 13, Exception 14 or Page Fault, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (303440).

If your VMware ESXi host experiences an error similar to one of these that does not point you to a general article, search for the error message and stack trace information within the Broadcom Support Portal. If the error has not been documented within the Broadcom Support Portal, collect the diagnostic information from the VMware ESXi host and submit a support request.

For more information, see:

Collecting diagnostic information for VMware products (367431)

Using the pattern analysis to troubleshoot multiple vmkernel errors on the same ESXi host

In the event that you experience multiple purple diagnostic screens from the same VMware ESXi host, you can use the sample of multiple purple diagnostic screens to determine the likeliness of an issue being related to hardware or software. This can be done by identifying patterns in these sections of the purple diagnostic screen:

The error message and the stack trace:
- If the error message and stack vary greatly between vmkernel errors, this indicates that software is not always hitting the same error. Although inconclusive, this may indicate a hardware issue.
- If the error message and the stack are always identical between vmkernel errors, this indicates that software is always hitting the same error. Although inconclusive, this may indicate a software issue.
- For more information about the error message you are experiencing, refer to the above section about the specific error message.
The physical CPU:
- If the physical CPU value remains the same across multiple vmkernel errors, this indicates that the software is always failing on the same physical CPU. Although inconclusive, this may indicate a CPU issue.
The world:
- If the world value remains the same across multiple VMkernel errors, this indicates that the vmkernel is failing when receiving instructions from the same world. Although inconclusive, this may indicate a world is sending instructions that may be triggering the VMkernel error.

Additional Information

This is a complete list of exceptions:

Exception Type 0 #DE: Divide Error
Exception Type 1 #DB: Debug Exception
Exception Type 2 NMI: Non-Maskable Interrupt
Exception Type 3 #BP: Breakpoint Exception
Exception Type 4 #OF: Overflow (INTO instruction)
Exception Type 5 #BR: Bounds check (BOUND instruction)
Exception Type 6 #UD: Invalid Opcode
Exception Type 7 #NM: Coprocessor not available
Exception Type 8 #DF: Double Fault
Exception Type 10 #TS: Invalid TSS
Exception Type 11 #NP: Segment Not Present
Exception Type 12 #SS: Stack Segment Fault
Exception Type 13 #GP: General Protection Fault
Exception Type 14 #PF: Page Fault
Exception Type 16 #MF: Coprocessor error
Exception Type 17 #AC: Alignment Check
Exception Type 18 #MC: Machine Check Exception
Exception Type 19 #XF: SIMD Floating-Point Exception
Exception Type 20-31: Reserved
Exception Type 32-255: User-defined (clock scheduler)

For more information about these Exceptions, see the Call and Return Operation for Interrupt or Exception Handling Procedures section in Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture and Chapter 6: Interrupts and Exception Handling in Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Collecting diagnostic information for VMware products

Feedback

thumb_up Yes

thumb_down No