vHBAs and other PCI devices may stop responding in ESXi 6.0.x, ESXi 5.x and ESXi/ESX 4.1 when using Interrupt Remapping
search cancel

vHBAs and other PCI devices may stop responding in ESXi 6.0.x, ESXi 5.x and ESXi/ESX 4.1 when using Interrupt Remapping

book

Article ID: 338436

calendar_today

Updated On: 11-12-2024

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
When using Interrupt Remapping on some servers, you may experience these symptoms on ESXi 6.0.x, ESXi 5.x and ESXi/ESX 4.1 hosts:
  • ESXi hosts are non-responsive
  • Virtual machines are non-responsive
  • HBAs stop responding
  • Other PCI devices stop responding
  • You may receive the Degraded path for an Unknown Device alerts in vCenter Server
  • You may see an illegal vector error in the VMkernel or messages logs shortly before an HBA stops responding to the driver. The error is similar to:

    ALERT: APIC: 1823: APICID 0x00000000 - ESR = 0x40

    Note:
    This issue only applies if you see this specific alert in the vmkernel or messages log files. If you do not see this message, you are not experiencing this issue.

  • For systems with QLogic HBA cards, the VMkernel or messages logs show that a card has stopped responding to commands:

    vmkernel: 6:01:42:36.189 cpu15:4274)<6>qla2xxx 0000:1a:00.0: qla2x00_abort_isp: **** FAILED ****
    vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: Failed mailbox send register test


  • The VMkernel or messages logs show the QLogic HBA card is offline:

    vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: ISP error recovery failed - board disabled

  • For systems with Emulex HBA cards, the VMkernel or messages logs show a card has stopped responding to commands:

    vmkernel: 6:22:52:00.983 cpu0:4684)<3>lpfc820 0000:15:00.0: 0:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
    vmkernel: 6:22:52:32.408 cpu0:4684)<3>lpfc820 0000:15:00.0: 0:0310 Mailbox command x5 timeout Data: x0 x700 x0x4100a2811820
    vmkernel: 6:22:52:32.408 cpu0:4684)<3>lpfc820 0000:15:00.0: 0:0345 Resetting board due to mailbox timeout
    vmkernel: 6:22:53:02.416 cpu2:4684)<3>lpfc820 0000:15:00.0: 0:2813 Mgmt IO is Blocked d00 - mbox cmd 5 still active
    vmkernel: 6:22:53:02.416 cpu2:4684)<3>lpfc820 0000:15:00.0: 0:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
    vmkernel: 6:22:53:33.833 cpu0:4684)<3>lpfc820 0000:15:00.0: 0:0310 Mailbox command x5 timeout Data: x0 x700


  • For systems with LSI1064E (LSI1064, LSI1064E) or LSI1068E series SCSI controllers, if the ESXi host is connected to internal disks, the /var/log/vmkernel.log file shows errors similar to:

    ScsiDeviceIO: 2316: Cmd(0x41240074e3c0) 0x1a, CmdSN 0x12ee to dev "mpx.vmhba0:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
    ScsiDeviceIO: 2316: Cmd(0x41240074e3c0) 0x4d, CmdSN 0x12f1 to dev "mpx.vmhba1:C0:T8:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x35 0x1.


  • For systems with Megaraid 8480 SAS SCSI controllers, the VMkernel or messages logs show the controller has stopped responding to commands:

    vmkernel: 12:14:17:35.206 cpu15:4247)megasas: ABORT sn 94489613 cmd=0x2a retries=0 tmo=0
    vmkernel: 12:14:17:35.206 cpu15:4247)<5>0 :: megasas: RESET sn 94489613 cmd=2a retries=0
    vmkernel: 12:14:17:35.206 cpu4:4435)WARNING: LinScsi: SCSILinuxQueueCommand: queuecommand failed with status = 0x1055 Host Busy vmhba0:2:0:0 (driver name: LSI Logic SAS based Mega RAID driver)


Note: This log excerpt is an example. Date, time, and environmental variables may vary depending on your environment.






Environment

VMware vSphere ESXi 5.1
VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.5

Cause

ESXi 4.1 and later versions introduced interrupt remapping code that is enabled by default. This code is incompatible with some servers. This technology has been introduced by the vendor for more efficient IRQ routing and which should improve performance.

Note: If this issue occurs in the PCI device from which the ESXi/ESX host boots (either locally or using SCSI/RAID), or when the host boots from SAN using iSCSI/FC HBA, the APIC error(s) is not logged. To troubleshoot the issue in this case, enable and configure remote syslog logging. For more information, see Configuring syslog on ESXi 5.0 (2003322). Alternatively, you can test this by disabling IRQ remapping.

Resolution

Several server vendors have released fixes in the form of Server BIOS updates. Contact your server vendor to see if they have a fix available. For IBM models, including but not limited to the IBM BladeCenter HS22 series and System x3400/x3500 and x3600 series systems, see the IBM Knowledge Base article MIGR-5086606 for a firmware update and additional information.

Note: The preceding link was correct as of March 10, 2015. If you find the link is broken, provide feedback and a VMware employee will update the link.

If a firmware fix is not available, work around this issue by disabling interrupt mapping on your  ESXi 5.x and ESXi 6.0.x host and reboot the host to apply the settings.

ESXi 4.1

To disable interrupt remapping on ESXi 4.1, perform one of these options:
  • Run this command from a console or SSH session to disable interrupt mapping:

    # esxcfg-advcfg -k TRUE iovDisableIR

    To back up the current configuration, run this command twice:

    # auto-backup.sh

    Note: It must be run twice to save the change.

    Reboot the ESXi/ESX host:

    # reboot

    To check if interrupt mapping is set after the reboot, run the command:

    # esxcfg-advcfg -j iovDisableIR

    iovDisableIR=TRUE


  • In the vSphere Client:

    1. Click Configuration > (Software) Advanced Settings > VMkernel.
    2. Click VMkernel.Boot.iovDisableIR, then click OK.
    3. Reboot the ESXi/ESX host.

ESXi 5.x and ESXi 6.0.x

ESXi 5.x and ESXi 6.0.x does not provide this parameter as a GUI client configurable option. It can only be changed using the esxcli command or via the PowerCLI.

  • To set the interrupt mapping using the esxcli command:

    List the current setting by running the command:

    # esxcli system settings kernel list -o iovDisableIR

    You see output similar to:

    Name Type Description Configured Runtime Default
    ------------ ---- --------------------------------------- ---------- ------- -------
    iovDisableIR Bool Disable Interrupt Routing in the IOMMU FALSE FALSE FALSE


    Disable interrupt mapping on the host using this command:

    # esxcli system settings kernel set --setting=iovDisableIR -v TRUE

    Reboot the host after running the command.

    Note: If the hostd service fails or is not running, the esxcli command does not work. In such cases, you may have to use the localcli instead. However, the changes made using localcli do not persist across reboots. Therefore, ensure that you repeat the configuration changes using the esxcli command after the host reboots and the hostd service starts responding. This ensures that the configuration changes persist across reboots.

  • To set the interrupt mapping through PowerCLI:

    Note: The PowerCLI commands do not work with ESXi 5.1. You must use the esxcli commands as detailed above.

    PowerCLI> Connect-VIServer -Server xx.xx.xx.xx -User Administrator -Password passwd
    PowerCLI> $myesxcli = Get-EsxCli -VMHost xx.xx.xx.xx
    PowerCLI> $myesxcli.system.settings.kernel.list($false, 'iovDisableIR')

    Configured : FALSE
    Default : FALSE
    Description : Disable Interrrupt Routing in the IOMMU
    Name : iovDisableIR
    Runtime : FALSE
    Type : Bool

    PowerCLI> $myesxcli.system.settings.kernel.set("iovDisableIR","TRUE")
    true

    PowerCLI>$myesxcli.system.settings.kernel.list($true, 'iovDisableIR')

    Configured : TRUE
    Default : FALSE
    Description : Disable Interrrupt Routing in the IOMMU
    Name : iovDisableIR
    Runtime : FALSE
    Type : Bool


  • After the host has finished booting, you see this entry in the /var/log/boot.gz log file confirming that interrupt mapping has been disabled:

    TSC: 543432 cpu0:0)BootConfig: 419: iovDisableIR = TRUE


Additional Information