After a SAN switch failure, some VMs report I/O failures, although active paths remain available
search cancel

After a SAN switch failure, some VMs report I/O failures, although active paths remain available

book

Article ID: 433891

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • A SAN switch suffers a kernel panic or other failure, which causes the loss of half of the paths to the storage array

  • Active paths remain available to all devices, and failover to these paths completes successfully

  • Some VM applications are impacted due to I/O failures 

Environment

VMware vSphere ESXi (all versions)

Cause

  • This arises if I/O begins to fail some time before the SAN switch fails fully.


    var/log/vmkernel.log may report: 

1) Communication failures:

vmkernel: cpu5:2097670)ScsiDeviceIO: 4670: Cmd(0x45db26c52400) 0x89, cmdId.initiator=0x430b88fdce40 CmdSN 0xb70dbd from world 2097224 to dev "naa.######################" failed H:0x5 D:0x0 P:0x0 Cancelled from NMP layer.


"H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x8 0x0" (=> logical unit communication failure)

2) I/O aborts, e.g.:
vmkernel: cpu5:2097670)ScsiDeviceIO: 4670: Cmd(0x45db26c52400) 0x89, cmdId.initiator=0x430b88fdce40 CmdSN 0xb70dbd from world 2097224 to dev "naa.######################" failed H:0x5 D:0x0 P:0x0 Cancelled from NMP layer.

3) In addition, datatore heartbeat timeout may be reported in /var/log/vobd.log:
vobd[2097763]  [vmfsCorrelator] 13674480723148us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume <vmfs-UUID> (<datastoreName>): [Timeout] [HB state abcdef02 offset 3145728 gen 29 stampUS 13674480722461 uuid <uuid> jrnl <FB 13> drv 24.82]

4) Subsequently, an Registered State Change Notification (RSCN) may confirm switch failure and loss of paths:
vmkernel: cpu32:2098470)lpfc: lpfc_els_rcv_rscn:7907: vmhba2 0214 RSCN received Data: x800220 x0 x4 x1 


Note: This is sample logging. The logging pattern will vary in specific instances.  

Resolution

If I/O is failing for some time before full failure of the SAN switch, the impact might be mitigated by disabling the failing paths.  

See Disabling Storage Paths