Understanding a "Failed to ack TLB invalidate" purple diagnostic screen
search cancel

Understanding a "Failed to ack TLB invalidate" purple diagnostic screen

book

Article ID: 324947

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

A purple diagnostic screen that reports information similar to:
  • PCPU 3 locked up. Failed to ack TLB invalidate.
    @BlueScreen: PCPU 3 locked up. Failed to ack TLB invalidate.

  • cpu34:9213)VMware ESXi #.#.# [Releasebuild-########] PCPU 18 locked up. Failed to ack TLB invalidate (total of 5 locked up, PCPU(s): 0,10,11,16,18).cpu34:9213)cr0=0x######## cr2=0x######### cr3=######### cr4=######

Resolution

Overview

  • Context – A context is a collection of CPU specific information that pertains to a specific process. The context includes the values of the CPU registers and memory management information.
  • Context switch – A context switch occurs when an interrupt occurs. The system saves the context and restores the context of another process.
  • Translation Look-aside Buffer (TLB) – The TLB is a table of keys and values that improve the performance of addressing virtual memory. This is part of the memory management information included in the context.
When an interrupt occurs, a context switch must be performed. Prior to loading a new context and loading a new TLB, the current TLB needs to be flushed or invalidated. This type of purple diagnostic screen occurs when the physical CPU does not perform this flush for a prolonged period of time.


Diagnostic Information

Extract the ESXi host logs that led to the purple diagnostic screen and examine it for a potential cause. To extract the logs, see Extracting the log file after an ESX or ESXi host fails with a purple screen error (1006796).

This is an example of the diagnostic information that is included in the purple diagnostic screen:
VMware ESX Server [Releasebuild-########]
PCPU 3 locked up. Failed to ack TLB invalidate.
gate=0x0 frame=0x######## eip=0x###### cr2=0x0 cr3=0x######## cr4=#####
eax=0x0 ebx=0x0 ecx=0x0 edx=0x0 es=0x0 ds=0x0
fs=0x0 gs=0x0 ebp=0x####### esi=0x0 edi=0x0 err=0 ef=0x0
cpu # #### vmm0:keys: cpu # #### mks:dc02: CPU # #### helper1-3: cpu 3 3012 vmm0:erpt:
cpu # #### vmm0:keys: cpu # #### vmm0:erpt: cpu # #### vmm0:time: cpu 7 2394 vmm0:addc:
@BlueScreen: PCPU 3 locked up. Failed to ack TLB invalidate.
0x343bed4:[0x61fafc]_vLog+0x0(0x78cb60, 0x343bef0, 0x343bf10)
0x343bee4:[0x61fafc]_vLog+0x0(0x78cb60, 0x3, 0x1)
0x343bf10:[0x63fd00]TLBInvalidateFailed+0x90(0x1, 0xffffffff, 0x0)
0x343bf38:[0x640012]TLBDoInvalidate+0x27a(0xffffffff, 0xffffffff, 0x343bf74)
0x343bf48:[0x63fbb5]TLB_Flush+0x35(0x0, 0x0, 0x400)
0x343bf74:[0x65d878]XMapFlushDelayedUnmaps+0x70(0x0, 0x12130b4, 0x0)
0x343bfac:[0x6463e3]helpFunc+0x1ff(0x1, 0xc9256c, 0x0)
0x343bffc:[0x702bb8]CpuSched_StartWorld+0x11c(0x0, 0x0, 0x0)
0x343c000:[0x0](0x0, 0x0, 0x0)
VMK uptime: 210:15:14:32.718 TSC: 47315535316217757
cpu#:####)Heartbeat: 469: PCPU 3 didn't have a heartbeat for 3781 seconds. *may* be locked up
cpu#:####)Heartbeat: 469: PCPU 3 didn't have a heartbeat for 7621 seconds. *may* be locked up
cpu#:####)Heartbeat: 469: PCPU 3 didn't have a heartbeat for 15301 seconds. *may* be locked up
Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log
 
From the preceding example:
  • Identify the physical CPU that is misbehaving. In the this example, it is physical CPU 3:

    PCPU 3 locked up.

  • Length of time the system waited for the PCPU to invalidate the TLB:

    cpu#:####)Heartbeat: 469: PCPU 3 didn't have a heartbeat for 3781 seconds. *may* be locked up
    cpu#:####)Heartbeat: 469: PCPU 3 didn't have a heartbeat for 7621 seconds. *may* be locked up
    cpu#:####)Heartbeat: 469: PCPU 3 didn't have a heartbeat for 15301 seconds. *may* be locked up

Newer releases have the following:

  • Example of ESXi 7.# Purple screen Diagnostic Dump with same information:
    World: ####: PRDA 0x############ ss 0x0 ds 0x### es 0x### fs 0x### gs 0x###
    World: ####: TR 0xf58 GDT 0x############ (0x###) IDT 0x############ (0x###)
    World: ####: CR0 0x######## CR3 0x###### CR4 0x######
    Backtrace for current CPU #53, worldID=2098409, fp=0x0
    0x453ea749bab0:[0x420023cfef1f]PanicvPanicInt@vmkernel#nover+0x327 stack: 0x453ea749bb88, 0x0, 0x420023cfef1f, 0x0, 0x453ea749bab0
    0x453ea749bb80:[0x420023cff478]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453ea749bbe0, 0x453ea749bba0, 0xcb, 0x64, 0x4
    0x453ea749bbe0:[0x420023d124b8]TLBGetLockedCPUBacktraces@vmkernel#nover+0x269 stack: 0xffffffff00000064, 0x453ea749bf40, 0x1405600, 0x453ea749bec0, 0x42004d405600
    0x453ea749be70:[0x420023d127a8]TLBDoInvalidate@vmkernel#nover+0x231 stack: 0x8d921f, 0x432a55c02f08, 0x432a55c02010, 0x432a55c02f08, 0x1
    0x453ea749bec0:[0x4200240c31a8]UserMem_CartelFlush@vmkernel#nover+0xc1 stack: 0x200, 0x0, 0x0, 0x0, 0x0
    0x453ea749bf70:[0x4200240cef9a]UserMemTouchedEstimationLoop@vmkernel#nover+0xa3 stack: 0x4316d2022000, 0x4316d2001220, 0x0, 0x23d5b234, 0x453ea749f000
    0x453ea749bfe0:[0x420023fb3e29]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0, 0x420023cc4c20, 0x0, 0x0, 0x0
    0x453ea749c000:[0x420023cc4c1f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0
    VMware ESXi 7.#.# [Releasebuild-########]
    PCPU 100 locked up. Failed to ack TLB invalidate (at least 4 locked up, PCPU(s): 63,76,93,100).
    PCPU(s) did not respond to NMI. Possible hardware problem; contact hardware vendor.


  • Type: Failed to ack TLB invalidate
    Example Error: PCPU 1 locked up. Failed to ack TLB invalidate.
    Description: Physical CPUs fail when trying to clear memory page tables.

Additional Information

For additional information about investigating Purple Diagnostic Screen Issues see Interpreting a host purple diagnostic screen.