File opens will appear to be stuck on the NFSv4.1 mount forever.
There won't be any error message on the terminal.
VMware vSAN 7.x
The issue is caused as the NFS v4.1 client does not send a RECLAIM_COMPLETE to the NFS server in some rare cases after server failover.
This is a required operation for NFS server as stated in NFSv4.1 RFC 5661. The server may get stuck in the grace period for that particular client which means all the IO or locking operations will be stuck forever.
This is not a vSAN File Service issue as the Linux NFS client is failing to send the RECLAIM_COMPLETE to the NFS server residing on the vSAN File Service container.
To confirm that the File Open is in a hung state run a packet capture from the NFS client OS as follows:
(1) Install tcpdump if it is not installed.
(2) Find the network interface to be run with tcpdump.
tcpdump -D
Command output is a list of all available network interfaces that tcpdump can collect packets from, pick the first interface.
(3) Enable rpcdebug
sysctl -w sunrpc.nfs_debug=1023
rpcdebug -s all -m nfs
rpcdebug -s all -m rpc
(4) Run tcpdump command to capture packets to/from each File Service container IP
tcpdump -C 1024 -W 5 -G 60000 -s 350 -i <network interface> host <one container IP> -w <packet capture file>
For example: tcpdump -C 1024 -W 5 -G 60000 -s 350 -i ens32 host 10.x.x.53 -w /sdb/tcpdump_container_10.x.x.53.pcap
(5) Stop packets capturing, disable rpcdebug
killall tcpdump
rpcdebug -c all -m nfs
rpcdebug -c all -m rpc
Note: Ensure there is enough disk space for the packet capture files. Without enough space, the packet capture file will be rolled over. Also, depending on container IPs configured for File Service, a set of tcpdump commands each with one container IP should be run.
Things to notice in the packet capture.
Workaround is to unmount and mount the file share on client.