Some iSCSI storage arrays from different array vendors do not behave appropriately during periods of network congestion. This problem is related to the TCP/IP implementation of these arrays and can severely impact the read performance of storage attached to the ESXi/ESX software through the iSCSI initiator. The problem has also been reported in native (non-virtualized) environments involving the Microsoft iSCSI initiator.
Background Concepts
To understand this problem, you should be familiar with several TCP concepts:
- Delayed ACK
- Slow start
- Congestion avoidance
Delayed ACKA central precept of the TCP network protocol is that data sent through TCP be acknowledged by the recipient. According to RFC 813, "
Very simply, when data arrives at the recipient, the protocol requires that it send back an acknowledgement of this data. The protocol specifies that the bytes of data are sequentially numbered, so that the recipient can acknowledge data by naming the highest numbered byte of data it has received, which also acknowledges the previous bytes.". The TCP packet that carries the acknowledgement is known as an ACK.
A host receiving a stream of TCP data segments can increase efficiency in both the network and the hosts by sending less than one ACK acknowledgment segment per data segment received. This is known as a delayed ACK. The common practice is to send an ACK for every other full-sized data segment and not to delay the ACK for a segment by more than a specified threshold. This threshold varies between 100ms and 500ms. ESXi/ESX uses delayed ACK because of its benefits, as do most other servers.
Slow start and Congestion avoidanceCongestion can occur when there is a mismatch of data processing capabilities between two elements of the network leading from the source to the destination. Congestion manifests itself as a delay, timeout, or packet loss. To avoid and recover from congestion, TCP uses two algorithms—the congestion avoidance algorithm and the slow start algorithm. Although the underlying mechanisms of these two algorithms differ, the underlying concept is that when congestion occurs, the TCP sender must slow down its transmission rate and then increase the rate as retransmitted data segments are acknowledged.
When congestion occurs, the typical recovery sequence for TCP/IP networks that use delayed ACK and slow start is:
- The sender detects congestion because it did not receive an ACK within the retransmission timeout period.
- The sender retransmits the first data segment and waits for the ACK before sequencing the remaining segments for retransmission.
- The receiver receives the retransmitted data segment and starts the delayed ACK timer.
- The receiver transmits the ACK when the delayed ACK timer times out. During this waiting period, there are no other transmissions between the sender and receiver.
- Having sent the ACK, the sender then retransmits the next two data segments back to back.
- The receiver, upon receiving the second data segments, promptly transmits an ACK.
- The sender, upon receiving the ACK, retransmits the next four data segments back to back.
- This sequence continues until the congestion period passes and the network returns to a normal traffic rate.
In this recovery sequence, the longest lag time comes from the initial delayed ACK timer in step 3.
Problem Description
The affected iSCSI arrays in question take a slightly different approach to handling congestion. Instead of implementing either the slow start algorithm or congestion avoidance algorithm, or both, these arrays take the very conservative approach of retransmitting only one lost data segment at a time and waiting for the host's ACK before retransmitting the next one. This process continues until all lost data segments have been recovered.
Coupled with the delayed ACK implemented on the ESXi/ESX host, this approach slows read performance to a halt in a congested network. Consequently, frequent timeouts are reported in the kernel log on hosts that use this type of array. Most notably, the VMFS heartbeat experiences a large volume of timeouts because VMFS uses a short timeout value. This configuration also experiences excessively large maximum read-response times (on the order of several tens of seconds) reported by the guest. This problem is exacerbated when reading data in large block sizes. In this case, the higher bandwidth contributes to network congestion, and each I/O is comprised of many more data segments, requiring longer recovery times.
In the
vmkernel.log
file, you may see entries similar to:
-09-17T15:07:19Z iscsid: discovery_sendtargets::Running discovery on IFACE iscsi_vmk@vmk2(iscsi_vmk) (drec.transport=iscsi_vmk)
-09-17T15:07:19Z iscsid: cannot make connection to 10.0.68.2:3260 (111)
-09-17T15:07:19Z iscsid: connection to discovery address 10.0.68.2 failed
-09-17T15:07:19Z iscsid: connection login retries (reopen_max) 5 exceeded
-09-17T15:07:19Z iscsid: Login Target Skipped: iqn.2003-10.com.temp:hvtiscsi:1808:vmwds01 if=iscsi_vmk@vmk1 addr=10.0.68.2:3260 (TPGT:1 ISID:0x2) (Already Running)
-09-17T15:07:19Z iscsid: Login Target Skipped: iqn.2003-10.com.temp:hvtiscsi:1808:vmwds01 if=iscsi_vmk@vmk2 addr=10.0.68.2:3260 (TPGT:1 ISID:0x3) (Already Running)