Storage path failover failure during TUR command retries on ESXi
search cancel

Storage path failover failure during TUR command retries on ESXi

book

Article ID: 345231

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

An ESXi host does not initiate a storage path failover despite path instability when the Test Unit Ready (TUR) command repeatedly returns retry requests. This behavior results in:

  • Storage paths appearing intermittently as "dead" or "offline" without automated recovery.
  • High latency or performance degradation for sensitive Virtual Machines, such as Database servers.
  • Virtual machine tasks failing with errors such as "Unable to communicate with the remote host."
  • Log entries in /var/log/vmkernel.log or hostd.log showing:
    • SCSI_HOST_BUS_BUSY 0x02
    • SCSI_HOST_SOFT_ERROR 0x0b
    • SCSI_HOST_RETRY 0x0c
    • VMK_STORAGE_RETRY_OPERATION
    • NMP device "naa.XXX" state in doubt; requested fast path state update

Environment

  • VMware ESXi 5.5, 6.x, 7.x, and 8.x
  • Fiber Channel (FC) or iSCSI storage environments
  • Impacted storage arrays including EMC Symmetrix (SYMM), XtremeIO, and other ALUA-compliant arrays.

Cause

When a storage path experiences an issue, the ESXi host sends a TUR command to confirm the path status. If the storage array or controller returns a retry operation request (e.g., due to a busy host bus), the ESXi host continues to retry the command indefinitely instead of marking the path as dead and triggering a failover to a healthy path.

Resolution

To resolve this issue, enable the action_OnRetryErrors option. This configuration allows the ESXi host to mark a problematic path as dead when retry errors occur, which triggers an immediate failover to an alternative working path.

IMPORTANT: Always test configuration changes in a non-production environment first. Verify with your storage vendor that action_OnRetryErrors is recommended for your specific array firmware.

To configure the option, perform these steps:

  1. Enable the option for a specific device: Run the following command, replacing naa.### with your specific device identifier: # esxcli storage nmp satp generic deviceconfig set -c enable_action_OnRetryErrors -d naa.###

  2. Verify the configuration: Check the status of the device configuration by running: # esxcli storage nmp device list -d naa.###

    The output should show the option in the Config line: Config: {navireg ipfilter action_OnRetryErrors}

  3. Disable the option (if needed): To revert the change, run: # esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d naa.###

Note: The enable_action_OnRetryErrors configuration is persistent across host reboots.

Additional Information

Non Disruptive Upgrade (NDU) of EMC XtremeIO Arrays results in LUN connectivity lost

Understanding the storage path failover sequence in VMware ESXi native multipathing