iSCSI connection flapping between 'offline' and 'online'

Products

VMware vSphere ESXi

Issue/Introduction

You are using the software iSCSI initiator in VMware ESXi. iSCSI LUN connectivity issues on ESX/ESXi.
You have multiple VMkernel portgroups in the same subnet, accessing the same iSCSI target.
iSCSI connections are frequently being marked as offline, but not all connections come back online again.
Multiple dead paths accumulate over time.
No actual network traffic loss is experienced.
iSCSI initiator reconnecting randomly after reboot.
Latency issue observed with high KAVG/cmd and QAVG/cmd.
ESX log /var/log/vmkernel.log
vmkernel: 57:14:42:01.498 cpu5:4321)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba34:CH:0 T:6 CN:0: Failed to receive data: Connection closed by peer
vmkernel: 57:14:42:01.498 cpu5:4321)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba34:CH:0 T:6 CN:0: Connection rx notifying failure: Failed to Receive. State=Online
vmkernel: 57:14:42:01.498 cpu5:4321)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba34:CH:0 T:6 CN:0: Processing CLEANUP event
vmkernel: 57:14:42:01.748 cpu4:4321)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba34:CH:0 T:6 CN:0: iSCSI connection is being marked "OFFLINE"
[...]
vmkernel: 57:14:42:07.835 cpu1:4321)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba34:CH:0 T:6 CN:0: iSCSI connection is being marked "ONLINE"
Due to the number of iSCSI flapping messages logged, the host's resources become tied up. That can cause the ESXi host, along with the VMs running on it, to become unresponsive/hung.
Some cases, the CPU utilization of the ESXi host is very high

Environment

VMware vSphere ESXi 6.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

The iSCSI connection is closed by the iSCSI target and the connection closed by peer refers to TCP session reset/closure that is sent from the target storage to the ESXi host.
A network error occurred while the client was receiving data from the server.
This issue occurs due to improper storage array configuration, host networking configuration, or the VMware ESXi product including the MTU size set across the environment. The server accepts the connection, processes the request, and sends a reply to the client.
When the server closes the socket, the client believes that the connection has been terminated abnormally because the socket implementation sends a TCP reset segment telling the client to throw away the data and report an error.
Over-saturation of the SAN or SAN array, resulting in loss of communication, or storage task completion after the ESXi host has already stopped the task due to timeout (5000 ms).
Duplicate SAN targets IP addresses, resulting in intermittent connection loss and other anomalous behavior.
SAN target connection load balancing. Disable connection load balancing when using VMware ESXi software iSCSI initiators. You can utilize the Round-Robin multipathing policy to configure load balancing.

Resolution

To resolve this issue, verify the storage array configuration as well as the host networking configuration.

Fix the VMkernel networking misconfiguration:

- When using multiple VMkernel ports for software iSCSI, ensure that the number of VMkernel ports is lesser than or equal to the number of physical network interfaces.
- Check MTU size across your environment and make it uniform (regular, Jumbo frames).
- Ensure following Best Practices for Configuring Networking with Software iSCSI

If the issue still persists, collect the TCP-dump during these messages and the storage OEM should identify the reason.

Additional Information

For additional information and troubleshooting, see: