NFS connectivity issues on NetApp NFS filers on ESXi 7.x 8.x

Products

VMware vSphere ESXi

Issue/Introduction

This article provides steps to workaround the issue if unable to unmount NFS datastore in ESXi 7.x 8.x.

When using NFS datastores on some NetApp NFS filer models on an ESXi host, the following symptoms may appear.

The NFS datastores appear to be unavailable (grayed out) in vCenter Server, or when accessed through the vSphere Client
The NFS shares reappear after few minutes
Virtual machines located on the NFS datastore are in a hung/paused state when the NFS datastore is unavailable
This issue is most often seen after a host upgrade to ESXi 5.x or the addition of an ESXi 5.x host to the environment
ESXi - /var/log/vmkernel.log

NFSLock: 515: Stop accessing ## ######### #
NFS: 283: Lost connection to the server ###.###.###.# mount point /vol/datastore##, mounted as ########-########-####-############ ("datastore##")
NFSLock: 477: Start accessing ## ######### again
NFS: 292: Restored connection to the server ###.###.###.# mount point /vol/datastore##, mounted as ########-########-####-############ ("datastore##")

<YYYY-MM-DD>T<time> Z cpu2:8194)StorageApdHandler: 277: APD Timer killed for ident [########-########]
</time><YYYY-MM-DD>T<time></time> 607Z cpu2:8194)StorageApdHandler: 402: Device or filesystem with identifier [########-########] has exited the All Paths Down state.
<YYYY-MM-DD>T<time></time> Z cpu2:8194)StorageApdHandler: 902: APD Exit for ident [########-########]!
<YYYY-MM-DD>T<time></time> Z cpu16:8208)NFSLock: 570: Start accessing ## ############## again
<YYYY-MM-DD>T<time></time> Z cpu2:8194)WARNING: NFS: 322: Lost connection to the server ##.##.##.# mount point /vol/nfsexamplevolume, mounted as ########-########-####-############ ("NFS_EXAMPLE_VOLUME")
<YYYY-MM-DD>T<time></time> Z cpu2:8194)WARNING: NFS: 322: Lost connection to the server ##.##.##.# mount point /vol/nfsexamplevolume2, mounted as ########-########-####-############ ("NFS_EXAMPLE_VOLUME2")

ESXi - /var/log/vobd.log

<YYYY-MM-DD>T<time></time> Z: [vmfsCorrelator] ###############: [esx.problem.vmfs.nfs.server.disconnect] ###.###.###.# /vol/datastore## ########-########-####-############ volume-name:datastore##
<YYYY-MM-DD>T<time></time> Z: [vmfsCorrelator] ##############: [esx.problem.vmfs.nfs.server.restored] ###.###.###.# /vol/datastore## ########-########-####-############ volume-name:datastore##
When examining a packet trace from the VMkernel port used for NFS, zero window TCP segments may be seen originating from the NFS filer in Wireshark:

No Time Source Destination Protocol Length Info
###### ###.###### ##.#.#.## ##.#.#.## RPC 574 [TCP ZeroWindow] Continuation
###### ###.###### ##.#.#.## ##.#.#.## TCP 1514 [TCP ZeroWindow] [TCP segment of a reassembled PDU]
Host may disconnect in the environment.

NFS Queue depth settings :

|----Option Name..................................MaxQueueDepth
|----Current Value................................4294967295
|----Default Value................................4294967295
|----Min Value....................................1
|----Max Value....................................4294967295
|----Hidden.......................................false
|----Parent......................................./NFS/
|----Path........................................./NFS/MaxQueueDepth

Environment

VMware vSphere ESXi 7.x

VMware vSphere ESXi 8.x

Resolution

Workaround 1

To workaround this issue and prevent it from occurring, reduce the NFS.MaxQueueDepth advanced parameter to a much lower value. This reduces or eliminates the disconnections.

When sufficiently licensed, utilize the Storage I/O Control feature to work around the issue. An Enterprise Plus license for all ESXi hosts is required to use this feature.

When Storage I/O Control is enabled, it dynamically sets the value of MaxQueueDepth, circumventing the issue.

Workaround 2

To set the NFS.MaxQueueDepth advanced parameter using the vSphere Client:

Click the host in the Hosts and Clusters view.
Click the Configuration tab. Under Software, click Advanced Settings.
Click NFS, then scroll down to NFS.MaxQueueDepth.
Change the value to 64.
Click OK.
Reboot the host for the change to take effect.

To set the NFS.MaxQueueDepth advanced parameter via the command line:

SSH to the ESXi host via root.
Run the command:

esxcfg-advcfg -s 64 /NFS/MaxQueueDepth
Reboot the host for the change to take effect.
After the host reboots, run this command to confirm the change:

esxcfg-advcfg -g /NFS/MaxQueueDepth

Value of MaxQueueDepth is 64

Note: VMware suggests a value of 64. If this is not sufficient to stop the disconnects, further reduce the value by half. For example, change the value to 32 or 16 accordingly until the disconnects cease.

Additional Information

Troubleshooting NFS datastore connectivity issues
Configuring Flow Control on VMware ESXi and VMware ESX