Extremely slow NFS performance

Products

VMware NSX

Issue/Introduction

NFS is a network storage method that leverages networks to move data. NFS clients need to be able to read or write data over the network to the NFS filer. The performance of the network will have direct impact on the NFS storage array's performance. Ideally this traffic should be all layer 2 ( All IPs in the same subnet) to avoid added latency due to routing (layer 3).

This article assumes that the Edge is healthy and not having performance issues. It will focus on how to find where the latency is being induced.

Environment

VMware NSX

Cause

NSX leverages the physical network infrastructure to forward packets the same as any other network connected component.
NSX uses a virtual networking topology described as an overlay.
The overlay is supported by network circuit consisting of NSX Edge devices and ESXi Host (Host Transport Node).
Collectively these node are know as Transport Nodes.
The communication of the overlay relies on connectivity between the node through the tunnel end points (TEP). The geneve protocol encapsulates packets sent from the virtual machines communicating via the overlay networks. These tunnels are only between NSX transport node TEP interfaces. The job of the TEP is geneve encapsulation and decapsulation.

The overlay data path will have inherent latency that should be minimal (< 2mSec). Every component has to process the packet which will induce delay.
Geneve packets only exist between the transport nodes. Once the TEP has received the packet, and it is meant for a destination located behind it, the outer geneve wrapper is removed(decapsulation) and the packet is now the original source packet.

Packets must leave the ESXi hosts in order to reach other TEP interfaces. The packets are now exposed to delays induced by the physical infrastructure.
The latencies that occur are from physical switches and routers, and geographical location.

The source is sending packets to an outside destination. This is called North/South traffic. The steps that will be followed with traverse the Transport Node TEPs and then finally out the Edge to the physical network.

Step 1 -3 the packet is the normal TCP packet.
Step 4 the packet is now encapsulated per the geneve protocol for transport within the Transport Zone... TEP to TEP communication.
Step 7 The edge receive the packet and now it is unencapsulated.
Step 8 The packet is now leaving the edge on its way to the physical Network. The next hop is a physical network component.
Step 9 The packet is now routed to the second hop in the physical network.
Step 10 The physical switch connected to the NFS Storage Array.

Each hop will have network delay. The further from the source further down through the hops it will increase.
NSX is a virtualization of the network that still has to utilize the physical network to get between ESXi host.
Traffic egressing the virtual network will leave out of the ESX host that has the Edge device. Then it traverses the physical network to the destination.
A traceroute from a virtual machine to an external destination will show the hops of the datapath along with the latency in mSec.
Use this to determine where bottle necks are occurring.

Resolution

The issue is locating what latency is causing slow NFS storage performance.High latency has been identified as the issue.
The traceroute above exposed hop 4 as having the highest latency.
Hop 3 has an IP in the 196.254.0.0/16 range indicating that this is an intra-tier edge hop (Edge Device).
The customer has identified this hop 4 as a physical router. This is the next hop north of the NSX edge device.
All other hop IPs after that are between the physical router and the destination NFS storage array.

The onus for the component inducing the high latency is outside of NSX. NSX latency values are considered normal for this system.
The latency between the physical router and the next hops on the way to the NFS storage array are all normal.
The latency between the edge and the physical router is the issue.

The resolution for this is to use IP addresses for the client and NFS exporter interfaces that are in the same subnet. Keeping this communication in layer 2 switching will remove routing latency. In general it is best practice for NFS storage to be configured such that routing is not needed and the least amount of networking latency is experienced. Once the latency has been identified as originating from physical network components, VMware is no long responsible for performance. Careful design for NFS storage traffic is the responsibility of the enterprise.