VM cluster failover and unresponsiveness due to brief NFS storage inaccessibility
search cancel

VM cluster failover and unresponsiveness due to brief NFS storage inaccessibility

book

Article ID: 438865

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • A VM cluster fails over from one node (e.g., node 001) to another (e.g., node 002).
  • During this event, the guest OS becomes unresponsive, with VMware Tools indicating dropped heartbeats and VM logs confirming the failover event.
  • There are no APD events on the ESXi host
  • There are no linkstate events on the ESXi host
  • The vmware.log for the affected VMs will show the guest OSes were unresponsive for a period (e.g., 33 seconds) as shown by VMware Tools heartbeating:

    In(05) vcpu-1 - Tools: [AppStatus] Last heartbeat value 143091 (last received 33s ago)

  • The vmkernel.log will show the NFS storage becoming inaccessible, leading to the VMs being stunned due to no disk access, followed by the disk access being re-established:

    In(182) vmkernel: cpu30:2104329)NFS: 7038: Status:No connection. Retrying synchronous write I/O 1 of 25 times

    In(182) vmkernel: cpu42:2098487)NFSLock: 1525: Stop accessing fd 0x430c88810d70(vmx-####.vswp)  3
    In(182) vmkernel: cpu42:2098487)NFSLock: 1525: Stop accessing fd 0x430c889d0300(####.vswp)  3
    In(182) vmkernel: cpu42:2098487)NFSLock: 1525: Stop accessing fd 0x430c889d04d0(####.vmx.lck)  3
    In(182) vmkernel: cpu42:2098487)NFSLock: 1525: Stop accessing fd 0x430c88a11360(####-flat.vmdk)  3
    In(182) vmkernel: cpu42:2098487)NFSLock: 1489: Start accessing fd 0x430c889d04d0(####.vmx.lck) again
    In(182) vmkernel: cpu42:2098487)NFSLock: 1489: Start accessing fd 0x430c889d0300(####.vswp) again
    In(182) vmkernel: cpu42:2098487)NFSLock: 1489: Start accessing fd 0x430c88a11360(####-flat.vmdk) again

Environment

VMware vSphere ESXi

Cause

The underlying NFS storage backing the VMs becomes inaccessible for a brief period (e.g., approximately 20 seconds). This loss of storage connectivity causes the VMs to effectively "stun," pausing all processing until the storage is accessible again. There is an expected delay between the NFS volumes becoming accessible again and the VMs fully recovering from the stunned state.

Resolution

Resolution:

  1. Work with your local storage and network teams to identify why the NFS volume was briefly inaccessible.
  2. Review storage array and network switch logs corresponding to the timestamps of the NFS disconnects, as ESXi logs in this scenario do not show issues with the host NIC links or drivers.