Poor Network Throughput Between Virtual Machines on the Same ESX Server Machine

Products

VMware vSphere ESXi

Issue/Introduction

I've observed poor network throughput between virtual machines on the same ESX Server machine.

It seems worse than performance between the virtual machines and the physical machines to which they connect . What should I do to troubleshoot this?

Environment

VMware ESX Server 2.1.x
VMware ESX Server 2.0.x
VMware ESX Server 2.5.x

Resolution

Network throughput depends on a number of factors which vary across applications and hardware platforms.

Note: If a networking-intensive application doesn't require multiple processors, then using a single processor virtual machine is recommended as this allows ESX Server to make more efficient use of the underlying physical machine.

Throughput between a virtual machine and a physical machine can be higher than the throughput between virtual machines, especially when the virtual machine is the sender.

In some cases, low throughput between virtual machines on the same ESX Server machine may be caused by TCP flow control misfiring. This situation can be identified by observing virtual network card activity as follows:

Use the following command at the ESX Server service console to periodically monitor the following proc node:
watch cat /proc/vmware/net/vmnic0/stats
Note: The 0 value in the example above should be changed to match whatever vmnic you are monitoring.
While the watch is running, start the network application(s) running in the affected guest operating systems.
Examine value of the RxQOv counter in the line labeled Remote.
To stop the watch, press Ctrl-C.

Another way to collect the desired information is to create a small script: (note backticks "`" in the echo statement!):

while (`true`); do echo ------------ `/bin/date --rfc-822` >> /tmp/vmnic-log.txt # to display header cat /proc/vmware/net/vmnic0/stats |grep pkts >> /tmp/vmnic-log.txt # to display data cat /proc/vmware/net/vmnic0/stats |grep Remote >> /tmp/vmnic-log.txt sleep 5 ; done

Open a second console window and type tail -f /tmp/vmnic-log.txt to view the file as it's written. Cancel the script by pressing Ctrl-C in the original console window.

If the value of the RxQOv counter rises during this test, the receiver is running out of buffers to hold the data transmitted by the sender.

Buffer overflows result in data having to be retransmitted by the sender, which effectively limits the bandwidth. Possible workarounds are to increase the number of receive buffers, reduce the number of transmit buffers, or both. These workarounds may increase workload on the physical CPUs.

The default number of receive and transmit buffers is 100 each. The maximum possible for ESX Server 2.1.x is 128. You can alter the default settings by placing one or both of the following parameters in the .vmx (configuration) files for the affected virtual machines.

Below are examples of setting the maximum number of receive buffers and transmit buffers to 128 and 64, respectively. These values are not universally applicable; actual settings depend on the application and may require experimentation.

Ethernet0.numRecvBuffers=128
Ethernet0.numXmitBuffers=64

Note: Change the 0 value in the examples above to match whichever virtual NIC you wish to update.

In a limited number of cases, an alternative to the above method may be to use the vlance network adapter driver. The vmxnet and vlance network adapter drivers have different buffering schemes. When using vmxnet, if the RxQOv counter is seen to increase at a rapid rate, it may be possible to avoid buffer overflows and achieve higher throughput by using the vlance network adapter driver. Using vlance is not preferred, however, because any throughput increase achieved is typically lower than can be obtained with the vmxnet adapter driver. It is also possible to see a lower throughput with the vlance driver than originally observed with the vmxnet driver even with the buffer overflows condition.