Understanding the storage path failover sequence in VMware ESXi native multipathing
search cancel

Understanding the storage path failover sequence in VMware ESXi native multipathing

book

Article ID: 321364

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information on the VMware ESXi storage native multipathing failover sequence, as it is logged in /var/log/vmkernel.log file and /var/log/messages on the ESXi host.

Note: This document pertains specifically to storage path failover as implemented in the VMware multipathing module, the Native Multipathing Plug-in (NMP). For information about third party multipathing modules, refer to your vendor's documentation.

Environment


VMware vSphere ESXi 7.0

VMware vSphere ESXi 8.0

 

Resolution

Note: The example scenario in this article uses a S/W iSCSI initiator and a LUN with identifier naa.################################.
 
The VMware ESXi storage multipathing failover sequence is:
  1. The connection along a given path is detected as down or offline. For example:

    vmkernel: 188:04:24:16.970 cpu8:4288)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE"
     
  2. The ESXi host stops its iSCSI session. For example:

    vmkernel: 188:04:24:16.970 cpu8:4288)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.############.b1 TPGT: 2 TSIH: 0]
    vmkernel: 188:04:24:16.970 cpu8:4288)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: #.#.#.#:50439 R: #.#.#.#:3260]
     
  3. As a result of stopping that session, the iSCSI task is aborted. For example:

    vmkernel: 188:04:24:16.970 cpu11:4288)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:1 L:14 : Task mgmt "Abort Task" with itt=0x5155cba9 (refITT=0x5155cb93) timed out.
     
  4. The Native Multi-pathing Plugin detects a Host status of 0x1 for the reason that the command in-flight had failed. A host status of 0x1 translates to NO_CONNECT. For more information, see SCSI events that can trigger ESX server to fail a LUN over to another path. For example:

    vmkernel: 188:04:24:16.970 cpu1:4286)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41000716a200) to NMP device "naa.################################" failed on physical path "vmhba33:C0:T1:L7" H:0x1 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.
     
  5. Once the NMP receives this host status, it will send a TEST_UNIT_READY(TUR)command down that path to confirm that it is down, before initiating a failover. For example:

    vmkernel: 188:04:24:16.970 cpu1:4286)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.################################": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
     
  6. If this command also fails, the ESXi host's Path Selection Policy (PSP) activates the next path for the device (LUN). For example:

    vmkernel: 188:04:24:16.989 cpu1:4131)vmw_psp_mru: psp_mruSelectPathToActivateInt: Changing active path from vmhba33:C0:T1:L7 to vmhba33:C0:T0:L7 for device "naa.################################".
     
  7. This line indicates that the path change was successful. The NMP retries the queued commands down this path to ensure they complete successfully, despite a failover condition being triggered. For example:

    vmkernel: 188:04:24:17.974 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.################################" - issuing command 0x41000716a200.
     
  8. The initial commands may not immediately complete on failover (for example, if the LUN still has pending reservations). ESXi host sends a LUN reset if there is a pending SCSI reservation against the device or LUN. This ensures that the SCSI-2 based reservation from the previous initiator is broken, so that the ESXi host can resume I/O upon failover. For example:

    vmkernel: 188:04:24:17.974 cpu12:4108)WARNING: NMP: nmp_CompleteRetryForPath: Retry command 0x28 (0x41000716a200) to NMP device "naa.################################" failed on physical path "vmhba33:C0:T0:L7" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0

    This translates to:

    Host Status = 0x0 = OK
    Device Status = 0x2 = Check Condition
    Plugin Status = 0x0 = OK
    Sense Key = 0x6 = UNIT ATTENTION
    Additional Sense Code/ASC Qualifier = 0x29/0x0 = POWER ON OR RESET OCCURRED

     
  9. At this stage, the ESXi host can retry the next command in the queue:

    Sep 10 13:11:18 laesx01 vmkernel: 188:04:24:17.974 cpu12:4108)WARNING: NMP: nmp_CompleteRetryForPath: Retry world on with device "naa.################################" - retry the next command in retry queue
    Sep 10 13:11:18 laesx01 vmkernel: 188:04:24:17.974 cpu12:4108)ScsiDeviceIO: 747: Command 0x28 to device "naa.################################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
    Sep 10 13:11:18 laesx01 vmkernel: 188:04:24:17.974 cpu11:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.################################" - issuing command 0x41000706fa00
     
  10. Indication that the path failover was successful and commands are able to complete via the new path looks similar to:

    vmkernel: 188:04:24:17.975 cpu12:4108)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.################################"
     
  11. Finally, as this is a S/W iSCSI-based example, you also see the session marked "ONLINE" again:

    vmkernel: 188:04:24:20.405 cpu9:4288)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba33:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
    vmkernel: 188:04:24:20.405 cpu9:4288)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess [ISID: 00023d000001 TARGET: iqn.1992-04.com.emc:cx.############.b1 TPGT: 2 TSIH: 0]
    vmkernel: 188:04:24:20.405 cpu9:4288)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn [CID: 0 L: #.#.#.#:52160 R: #.#.#.#:3260]

Note: Since the storage stack handles failover identically for FC, this sequence, with the exception of steps 1, 2, 3, and 11, applies.