Linux virtual machines with NFSv3 mounts experience an operating system hang after more than 15 minutes outage on the upstream datapath
search cancel

Linux virtual machines with NFSv3 mounts experience an operating system hang after more than 15 minutes outage on the upstream datapath

book

Article ID: 341224

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction





Symptoms:
  • Applications interacting with an NFSv3 mount experience a hang on hard mounted NFSv3 mounts or an error on soft mounted NFSv3 mounts after more than 15 minute upstream data path failure
  • The virtual machine performance increasingly degrades with the time it resides in an NFS hung state


Environment

VMware NSX for vSphere 6.2.x
VMware NSX for vSphere 6.1.x
VMware NSX for vSphere 6.0.x

Cause

This issue occurs when upstream connectivity issues last for longer than 15 minutes and the sunrpc/NFSv3 package attempts to reuse the original source and destination port of the previous connection without properly sending an RST on FIN packet.

Because of this, the TCP flow is not expired in the NSX Distributed Firewall (DFW), causing it to reject any new SYNs sent on the same source/destination port.

It is recommended to open a support request with your Linux vendor pertaining to the TCP behavior in the sunrpc/NFSv3 packages. For reference, see [PATCH 00/11] Fix TCP connection port number reuse in NFSv3.


Resolution

This is a known issue affecting VMware NSX for vSphere 6.x.

Currently, there is no resolution.

To verify that you are experiencing this issue:
  1. Run the netstat -n |grep 2049 command and look for any SYN_SENT lines on the virtual machine that is experiencing the issue.

    For example:

    netstat -n |grep 2049

    tcp 0 0 172.16.12.3:936 172.16.12.2:2049 SYN_SENT


  2. On the ESXi host where the virtual machine resides, run this command to find the associated DFW filter name for the virtual machine/virtual NIC:

    # summarize-dvfilter | less

    You see output similar to:

    world 59441 vmm0:test1 vcUuid:'50 38 e0 a0 97 e2 9f 2f-b7 78 34 92 c3 74 61 36'
    port 67108909 test1.eth0
    vNic slot 2
    name: nic-59441-eth0-vmware-sfw.2
    agentName: vmware-sfw
    state: IOChain Attached
    vmState: Detached
    failurePolicy: failClosed
    slowPathID: none
    filter source: Dynamic Filter Creation


  3. Dump the current flows for the filter, if an existing flow is in place for the same Source port / Destination port pair the issue is present:

    [root@host:~] vsipioctl vsipfwcli -f nic-59441-eth0-vmware-sfw.2 -c 'describe connection stat;'
    0x5608efe80000e0c3 af 2 ethertype 0x0800 proto tcp 172.16.10.1:41510 -> 172.16.12.3:80 e6 24a 5 3
    0x5608efe80000e0c4 af 2 ethertype 0x0800 proto tcp 172.16.10.1:41512 -> 172.16.12.3:80 e6 24a 5 3
    0x5608efe80000e0c5 af 2 ethertype 0x0800 proto tcp 172.16.10.1:41518 -> 172.16.12.3:80 e6 29a 5 5
    0x5608efe80000e0c6 af 2 ethertype 0x0800 proto tcp 172.16.12.3:936 -> 172.16.12.2:2049 55e8 44288 92 d4

    0x5608efe80000e0c7 af 2 ethertype 0x0800 proto tcp 172.16.10.1:41520 -> 172.16.12.3:80 10e 272 6 4
To work around this issue, use one of these options:
  • Add the affected virtual machine to the NSX Manager exclusion list, and then remove it from the exclusion list. This removes all currently tracked flows and re-applies the Firewall. For more information, see the Exclude Virtual Machines from Firewall Protection section in the NSX Administration Guide.
  • Add all Linux virtual machines utilizing NFS to the Exclusion list, or disable the NSX Distributed Firewall in the Cluster where these virtual machines reside.
  • Implement NFSv4, which properly sends a RST when tearing down the established connection prior to moving to SYN_SENT state.

Note: The preceding link was correct as of October 27, 2015. If you find the link is broken, please provide feedback and a VMware employee will update the link.


Additional Information

This is an issue that pertains to the TCP behaviour of the Linux sunrpc / NFSv3 packages.

To be alerted when this document is updated, click the Subscribe to Article link in the Actions box.

アップストリーム データパスで 15 分を超えて停止すると NFSv3 マウントを持つ Linux 仮想マシンでオペレーティング システムのハングが発生する
上游数据路径发生故障超过 15 分钟后,具有 NFSv3 挂载的 Linux 虚拟机操作系统挂起