ESXi host fails with a diagnostic screen due to an Intel Virtualization Technology Erratum
search cancel

ESXi host fails with a diagnostic screen due to an Intel Virtualization Technology Erratum

book

Article ID: 313365

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about the problem's cause and the required BIOS fix/Faulty VT-d unit replacement.


Symptoms:
  • The ESXi host fails with a purple diagnostic screen due to an Intel Virtualization Technology for Directed I/O erratum in systems with Intel Xeon Processors
  • You see the backtrace:

    cpu2:32999)0x4390c119b660:[0x4180163128c3]VTDQISync@vmkernel#nover+0xf7 stack: 0x1
    cpu2:32999)0x4390c119b6a0:[0x4180163137b2]VTDIRWriteIRTE@vmkernel#nover+0x8e stack: 0x2e
    cpu2:32999)0x4390c119b6d0:[0x418016313895]VTDIRSteerVector@vmkernel#nover+0x61 stack: 0x43004d129f10
    cpu2:32999)0x4390c119b700:[0x4180162e96c9]IOAPICSteerVector@vmkernel#nover+0x59 stack: 0x1c00
    cpu2:32999)0x4390c119b740:[0x418016057514]IntrCookie_SetDestination@vmkernel#nover+0x174 stack: 0x4
 
  • You see the title on screen and in vmkernel.log:
​​@BlueScreen: IOMMU unit 1 did not complete processing of a queued invalidation wait descriptor after 8 secs. This is likely caused by a known VT-d hardware erratum with a BIOS work-around. Please update the machine's $


Environment

VMware vSphere ESXi 5.0
VMware ESXi 3.5.x Installable
VMware ESXi 4.0.x Embedded
VMware vSphere ESXi 5.5
VMware ESXi 4.1.x Embedded
VMware vSphere ESXi 6.5
VMware vSphere ESXi 6.0
VMware ESXi 3.5.x Embedded
VMware vSphere ESXi 5.1
VMware ESXi 4.1.x Installable
VMware ESXi 4.0.x Installable

Cause

The purple diagnostic screens are rare but can occur in platforms based on these processors:
  • Intel Xeon Gold Processor 61xx Series
  • Intel Xeon Processor 55xx Series
  • Intel Xeon Processor 56xx Series
  • Intel Xeon Processor 65xx Series
  • Intel Xeon Processor 75xx Series
  • Intel Xeon Processor E5-1400 v2 Product Family
  • Intel Xeon Processor E5-1600 v2 Product Family
  • Intel Xeon Processor E5-1600 v3 Product Family
  • Intel Xeon Processor E5-2400 Product Family
  • Intel Xeon Processor E5-2400 v2 Product Family
  • Intel Xeon Processor E5-2600 Product Family
  • Intel Xeon Processor E5-2600 v2 Product Family
  • Intel Xeon Processor E5-2600 v3 Product Family
  • Intel Xeon Processor E5-2600 v4 Product Family
  • Intel Xeon Processor E5-4600 Product Family
  • Intel Xeon Processor E5-4600 v2 Product Family
  • Intel Xeon Processor E5-4600 v3 Product Family
  • Intel Xeon Processor E5-4600 v4 Product Family
  • Intel Xeon Processor E7-2800 Product Family
  • Intel Xeon Processor E7-4800 Product Family
  • Intel Xeon Processor E7-8800 Product Family
  • Intel Xeon Processor E7-8800/4800/2800 v2 Product Families
  • Intel Xeon Processor E7-8800/4800 v3 Product Families
  • Intel Xeon Processor E7-8800/4800 v4 Product Families
The cause of this problem is an Intel IOMMU (VT-d) erratum that causes the IOMMU to stop processing IOTLB invalidation requests submitted by the ESXi host to the IOMMU. This is a non-recoverable condition that causes the ESXi host to fail.
 
Most Intel Xeon processor based platforms incorporate a work around for the erratum in the machine's BIOS. For platforms that do not, they will require a BIOS update as described in the Resolution section.
 
Public information about the erratum (BT98) for the Intel Xeon E5 processor family (Sandybridge EN/EP) can be found at Intel Xeon Processor E5 Family: Spec Update.
 
Information about this erratum in other processor families is not public. Please contact the machine's manufacturer or Intel for support

Resolution

VMWare recommends contacting the hardware manufacturer for updated BIOS or possible workarounds.

Possible workaround can include replacing CPU socket which owns faulty VTD unit. To help in identifying CPU socket refer to boot/vmkernel logs which gives base address of VT-d unit along with VT-d unit number.
e.g "VT-d unit 0: segment 0000 base 0xdc7fc000 ver 1:0 cap 0x8d2078c106f0466 ecap 0xf020df"
 
Note: A prior version of this KB article recommended that customers experiencing the problem described above work around it by configuring ESXi to disable the Intel VT-d interrupt remapper (setting boot option iovDisableIR=TRUE and rebooting). VMware ESXi 5.5 p10, 6.0 p04, 6.0 U3 and 6.5 by default disable the Intel VT-d interrupt remapper for this purpose.
 
VMware has recently received several reports indicating that disabling the Intel VT-d interrupt remapper is causing ESXi host failure on HPE
Gen8 platforms, see ESXi host fails with intermittent NMI purple diagnostic screen on HP ProLiant Gen8 servers (2149043). VMware is no longer recommending that the Intel VT-d interrupt remapper be disabled to work around the Intel VT-d erratum described in this article. VMware is recommending that the fix for the erratum be applied in the BIOS as described in the Intel specification updates for the affected processors.


Additional Information

由于 Intel Virtualization Technology 错误,ESXi 主机出现故障并显示诊断屏幕
ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers