Understanding an "Oops" purple diagnostic screen

Products

VMware vSphere ESXi

Issue/Introduction

This article describes the information displayed on a purple screen fault caused by the service console faulting (panic/oops). The purple diagnostic screen can include one or more of these messages:

Oops
CosPanic
COS Error

The purple diagnostic screen appears similar to:

If you encounter a purple diagnostic screen that does not match the symptoms above, see Interpreting an ESX host purple diagnostic screen (1004250).

Environment

VMware ESXi 3.5.x Embedded
VMware ESX Server 3.5.x
VMware ESXi 4.0.x Embedded
VMware ESXi 4.0.x Installable
VMware ESXi 3.5.x Installable
VMware ESX Server 2.5.x
VMware ESX 4.0.x
VMware ESX Server 3.0.x

Resolution

When a OOPS or Panic occurs in the service console of a VMware ESX host, a purple screen fault is generated.

Note: If the Advanced Setting, Misc.PsodOnCosPanic is set to zero (0), a purple screen fault does not occur. Ensure this is not the case as the purple screen information is necessary to diagnose any issues the host is experiencing. Also, ensure that the Misc.CosCoreFile is set correctly so that a core dump for the service console is also generated.

The contents of the service console fault based purple screen contain two main components. The first component is the VMkernel purple screen output and the second is the service console Linux kernel output. For more information related to decoding a VMkernel purple screen, see Interpreting an ESX host purple diagnostic screen (1004250) .

The contents from this example are:

VMware ESX Server [Releasebuild-64607]
Oops
frame=0x1f16d34 ip=0xc022e995 cr2=0x100 cr3=0x13401000 cr4=0x6f0
es=0x68 ds=0xc02a0068 fs=0x0 gs=0x0
eax=0x0 ebx=0x0 ecx=0x1 edx=0x800
ebp=0x0 esi=0x0 edi=0xc03a7b20 err=0 eflags=0x0
*0:1024/console 1:1025/idle1 2:1026/idle2 3:1027/idle3
4:1028/idle4 5:1029/idle5 6:1030/idle6 7:1031/idle7
0x0:[0xc022e995]blk_dev+0xbd98d934 stack: 0x0, 0x0, 0x0
VMK uptime: 0:00:02:17.807 TSC: 343459198808
0:00:02:11.319 cpu0:1024)Host: 4781: COS Error: Oops

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log

Stack trace from cos log:
<4>EIP: 0060:[<c022e995>] Tainted: P
<4>EFLAGS: 00010246
<4>
<4>EIP is at sr_finish [kernel] 0xa5 (2.4.21-47.0.1.ELvmnix/i686)
<4>eax: 00000000 ebx: 00000000 ecx: 00000001 edx: 00000800
<4>esi: 00000000 edi: c03a7b20 ebp: 00000000 esp: c204fd70
<4>ds: 0068 cs: 0060 es: 0068 ss: 0068
<4>Process esxcfg-rescan (pid: 2997, stackpage=c204f000)
<4>Stack: c045ca80 00000400 00000000 00000000 80000000 00000282 00000001 c03a7ac0
<4> c9fc6e00 c022ca10 00000003 c021cb94 c02d6fa8 00000000 00000000 00000004
<4> c204fdd0 c204fdd4 c204fdd8 c8c4ae00 c204fddc 00000000 00000004 c204fddc
<4>Call Trace: [<c022ca10>] sd_attach [kernel] 0x0 (0xc204fd94)
<4>[<c021cb94>] scan_scsis [kernel] 0x3d4 (0xc204fd9c)
<4>[<c0123b79>] printk [kernel] 0x149 (0xc204feb8)
<4>[<c0212b44>] proc_scsi_gen_write [kernel] 0x624 (0xc204feec)
<4>[<c0168ffe>] locate_fd [kernel] 0xae (0xc204ff40)
<4>[<c0180130>] proc_file_write [kernel] 0x40 (0xc204ff80)
<4>[<c0158a73>] sys_write [kernel] 0xa3 (0xc204ff94)
<4>[<c02a406f>] no_timing [kernel] 0x7 (0xc204ffc0)
<4>[<c02a002b>] zlib_tr_flush_block [kernel] 0x3b (0xc204ffe0)
<4>
<4>Code: 89 90 00 01 00 00 a1 80 9f 4b c0 80 4c 18 12 01 a1 80 9f 4b
<4>
<4>
<4>dell_rbu 0xd2188060 -s .data 0xd2189dcc -s .bss 0xd2189e00
<4>ppdev 0xd2185060 -s .data 0xd2186b80 -s .bss 0xd2186c00
<4>parport 0xd217a060 -s .data 0xd2183540 -s .bss 0xd2183880
<4>ipmi_devintf0xd2160060 -s .data 0xd21614e0 -s .bss 0xd2161580
<4>ipmi_si_drv0xd2171060 -s .data 0xd2177f00 -s .bss 0xd21780c0
<4>ipmi_msghandler0xd2168060 -s .data 0xd216f170 -s .bss 0xd216f1e0
<4>ipt_REJECT0xd2165060 -s .data 0xd21662c0 -s .bss 0xd2166320

The service console panic output starts from:

Stack trace from cos log:

This first important piece of information is the EIP and where the fault had occurred. This shows you where in the Linux kernel the system had caught (or triggered) the fault. In this example, the function that was running in the Linux kernel at the time was sr_finish. This function is used in the processing of storage information.

<4>EIP: 0060:[<c022e995>] Tainted: P
<4>EFLAGS: 00010246
<4>
<4>EIP is at sr_finish [kernel] 0xa5 (2.4.21-47.0.1.ELvmnix/i686)

The next lines are the register dump. This section shows the register and its contents at the time of the fault:

<4>eax: 00000000 ebx: 00000000 ecx: 00000001 edx: 00000800
<4>esi: 00000000 edi: c03a7b20 ebp: 00000000 esp: c204fd70
<4>ds: 0068 cs: 0060 es: 0068 ss: 0068

This line is very important. The line shows the process that was running at the time of the fault. In this case, a storage rescan was being performed:

<4>Process esxcfg-rescan (pid: 2997, stackpage=c204f000)

These lines contain the content of the stack:

<4>Stack: c045ca80 00000400 00000000 00000000 80000000 00000282 00000001 c03a7ac0
<4> c9fc6e00 c022ca10 00000003 c021cb94 c02d6fa8 00000000 00000000 00000004
<4> c204fdd0 c204fdd4 c204fdd8 c8c4ae00 c204fddc 00000000 00000004 c204fddc

These lines are the call trace of what the Linux kernel was doing prior to the failure. Use this information to help you diagnose any issues. In this example SCSI scanning was in progress:

<4>Call Trace: [<c022ca10>] sd_attach [kernel] 0x0 (0xc204fd94)
<4>[<c021cb94>] scan_scsis [kernel] 0x3d4 (0xc204fd9c)
<4>[<c0123b79>] printk [kernel] 0x149 (0xc204feb8)
<4>[<c0212b44>] proc_scsi_gen_write [kernel] 0x624 (0xc204feec)
<4>[<c0168ffe>] locate_fd [kernel] 0xae (0xc204ff40)
<4>[<c0180130>] proc_file_write [kernel] 0x40 (0xc204ff80)
<4>[<c0158a73>] sys_write [kernel] 0xa3 (0xc204ff94)
<4>[<c02a406f>] no_timing [kernel] 0x7 (0xc204ffc0)
<4>[<c02a002b>] zlib_tr_flush_block [kernel] 0x3b (0xc204ffe0)
<4>

This line is the machine code that was running on the CPU at the time of the fault:

<4>Code: 89 90 00 01 00 00 a1 80 9f 4b c0 80 4c 18 12 01 a1 80 9f 4b

This is a list of the kernel modules loaded:

<4>dell_rbu 0xd2188060 -s .data 0xd2189dcc -s .bss 0xd2189e00
<4>ppdev 0xd2185060 -s .data 0xd2186b80 -s .bss 0xd2186c00
<4>parport 0xd217a060 -s .data 0xd2183540 -s .bss 0xd2183880
<4>ipmi_devintf0xd2160060 -s .data 0xd21614e0 -s .bss 0xd2161580
<4>ipmi_si_drv0xd2171060 -s .data 0xd2177f00 -s .bss 0xd21780c0
<4>ipmi_msghandler0xd2168060 -s .data 0xd216f170 -s .bss 0xd216f1e0
<4>ipt_REJECT0xd2165060 -s .data 0xd21662c0 -s .bss 0xd2166320

Note: If you need more assistance diagnosing your purple screen error:

Review the main article in this series: Interpreting an ESX host purple diagnostic screen (1004250)
File a support request with VMware Support and note this KB article ID in the problem description. For more information, go to www.vmware.com/support/policies/howto.html.
Gather VMware Support Script Data. For information, see Collecting Diagnostic Information for VMware Products (1008524).

Additional Information

Known Issues

If you have an "Oops" purple diagnostic screen that exactly matches the error message outlined in one of these articles, follow the applicable directions:

Other Considerations

An Oops in the ESX service console may be triggered by a hardware issue, a software issue with the ESX VMkernel or Linux service console kernel, or with a driver or privileged third-party process running in the service console. If multiple failures have occurred, consider the pattern of failures prior to taking action.

If the error has not been documented within the knowledge base, collect diagnostic information from the ESX host and submit a support request. For more information, see Collecting Diagnostic Information for VMware Products (1008524) and How to Submit a Support Request.

Interpreting an ESX/ESXi host purple diagnostic screen
ESX host stops responding and displays a purple screen error after a storage rescan
VMware ESX 3.5, Patch ESX350-200808402-BG: Updates the Service Console Kernel
ESX 3.0.1 Service Console stops responding with a COS Oops Error
Collecting diagnostic information for VMware products
ESX Server 2.5.1 Fails to Boot on Systems with Adaptec RAID Controller