LUNs attached to VMware vSphere 6.0 hosts may remain in APD Timeout state after paths have recovered
search cancel

LUNs attached to VMware vSphere 6.0 hosts may remain in APD Timeout state after paths have recovered

book

Article ID: 317972

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
  • When an APD event occurs, LUNs connected to ESXi may remain inaccessible after paths to the LUNs recover.
  • The 140-second APD timeout expires even when paths to storage are recovered.
  • In the /var/log/vmkernel.log file, you experience these events in sequence:
     
    1. Device enters APD.
    2. Device exits APD.
    3. Heartbeat recovery and filesystem operations on the device fail due to timeout or not found or busy.
    4. The APD timeout expires despite the fact that the device exited APD previously.
       
  • This condition is associated with one or more of these behaviors:
     
    • Virtual machines becomes inaccessible.
    • Hosts becomes unresponsive.
    • Storage is not online, even though paths are up and available.
    • Datastore disappears from the vSphere Client, even when virtual machines on that datastore remain.
       
  • An APD event can be triggered by one or more of these events. This list is not exhaustive:
     
    • Failures of upstream Fibre Channel or Ethernet switches in such a way that affect all paths to the storage array
    • Storage array failure or reboot
    • Storage array firmware updates (some vendors)
Important: Not all APD events exhibit this behavior. In most cases, LUNs and datastores exit the APD condition normally and as expected.

Environment

VMware vSphere ESXi 6.0

Cause

This issue occurs due to a fault in APD handling. When this issue occurs, LUN paths are available and online during an APD event, but the APD timer continues upcounting until the LUN enters APD Timeout state. After the initial APD event, the datastore is inaccessible as long as active workloads are associated with the datastore.

Resolution

This issue is resolved in ESXi 6.0 Update 1, available at VMware Downloads. For more information, see the VMware ESXi 6.0 Update 1 Release Notes.
 
If you are unable to upgrade, there are no workarounds that can guarantee that this issue is not encountered during an APD event. However, there are two workarounds to restore production should this issue occur.
 
To work around the issue, use one of these options:


Additional Information

For more information regarding APD events, see:
    Storage device has entered the All Paths Down state
    All Paths Down timeout for a storage device has expired
    Storage device has recovered from the APD state
    连接到 VMware vSphere 6.0 主机的 LUN 在恢复路径后可能仍保持 APD 超时状态
    VMware vSphere 6.0 ホストに接続されている LUN がパスのリカバリ後も APD タイムアウト状態のままとなる

    Impact/Risks:
    • When this issue is encountered, virtual machines must be terminated to recover the datastore.
       
      • HA, if enabled, should recover these virtual machines on other hosts.
         
    • If management agents must be restarted, the host temporarily lose manageability through vCenter Server.