ESXi hosts become "Not Responding" and iSCSI datastores inaccessible due to MTU mismatch

Products

VMware vSphere ESXi

Issue/Introduction

Symptom:

Datastores are either not able to show any files on the vCenter or are in inaccessible state.
Raw Device Mappings (RDMs) stored on the SAN are inaccessible.
The attached storage devices are shown under devices in the vSphere Client.
In the Devices section, all LUNs appear as Attached, but the datastore column shows "Not Consumed".
The attached storage devices are shown when you run the "esxcfg-scsidevs -c" command.
When you attempt to display the hexadecimal view of an affected storage device using the "hexdump" command, you will see a black screen, as you are unable to read the LUN.

Validation:

In the "var/run/log/vmkernel.log" you will see error logs similar to:

<Date>T09:16:34.475Z cpu4:33603)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.################################4de" state in doubt; requested fast path state update...
<Date>T09:16:34.576Z cpu4:41898)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x28 (0x413685059c40, 34048) to dev "naa.################################4de" on path "vmhba37:<UUID>" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
<Date>T09:16:35.403Z cpu4:32809)ScsiDeviceIO
<Date>T09:17:56.853Z cpu14:33601)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.###.###.118:62638 R: 192.###.###.104:3260]
<Date>T09:17:56.953Z cpu2:32800)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x28 (0x41368507dc00, 34012) to dev "naa.################################4de" on path "vmhba37:<UUID>" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
<Date>T09:17:56.957Z cpu9:69616)VMW_SATP_LOCAL: satp_local_updatePathStates:458: Failed to update path "vmhba37:<UUID>" state. Status=Transient storage condition, suggest retry
ESXi hosts may transition to a "Disconnected" or "Not Responding" state in vCenter Server due to a buildup of queued I/O and failed VMFS heartbeats.
Command-line utilities, such as esxcli, may hang or become completely unresponsive when executed on the affected ESXi hosts.
In the /var/run/log/vmkernel.log, you will see continuous Power-on Reset to the storage target :

YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu74:2098409)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. hppAction = 1
YYYY-MM-DDTHH:MM:SS.ZZ In(14) vobd[2098052]: [scsiCorrelator] 5994440993us: [vob.scsi.scsipath.por] Power-on Reset occurred on #################
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu70:2098419)WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x2. hppAction = 1

Due to the number of iSCSI flapping messages logged, the host's resources become tied up. That can cause the ESXi host, along with the VMs running on it, to become unresponsive/hung.

YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu42:2098333)WARNING: iscsi_vmk: iscsivmk_StopConnection:736: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu42:2098333)WARNING: iscsi_vmk: iscsivmk_StopConnection:740: Sess [ISID: 00023d000002 TARGET: iqn.###### TPGT: 409 TSIH: 0]
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu42:2098333)WARNING: iscsi_vmk: iscsivmk_StopConnection:741: Conn [CID: 0 L: ##.##.##.##:31745 R: ##.##.##.##:3260]
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu2:2130077)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:637: vmhba64:CH:1 T:0 L:4 : Task mgmt "Abort Task" with itt=0x6db4 (refITT=0x6dac) timed out.
[...]
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu42:2098000)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.####" state in doubt; requested fast path state update...
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu10:2107029)WARNING: VMW_SATP_ALUA: satp_alua_getTargetPortInfo:190: Could not get page 83 INQUIRY data for path "vmhba64:C1:T0:L3" - Timeout (195887137)
[...]
YYYY-MM-DDTHH:MM:SS.ZZ Wa(180) vmkwarning: cpu42:2098333)WARNING: iscsi_vmk: iscsivmk_StartConnection:918: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "ONLINE"

Environment

VMware vSphere ESXi

Cause

This issue is caused by an MTU mismatch in the iSCSI configuration. The resulting dropped packets and stuck I/O requests overwhelm the storage queues, which deadlocks the ESXi management service (hostd). This causes the host to drop offline ("Not Responding" in vCenter) and CLI tools like esxcli to freeze.

Resolution

To resolve the issue, correct the MTU mismatch.

Determine the configured MTU by design, refer to the original iSCSI configuration documentation for your environment. Once attained you can then start applying that MTU size across your environment.
Ensure the allowed MTU is consistent on:

- The SAN array
- All physical network switches
- All virtual network switches
- All portgroups

To check the connectivity through CLI :

vmkping without any MTU parameter succeeds -

[root@esxi01 :~ ] vmkping -I vmk1 <Target storage IP>
PING <Target storage IP> : 56 data bytes
64 bytes from <Target storage IP>: icmp seq=0 ttl=64 time=0.260 ms
64 bytes from <Target storage IP>: icmp seq=1 ttl=64 time=0.222 ms
64 bytes from <Target storage IP>: icmp seq=2 ttl=64 time=0.285 ms

<Target storage IP> ping statistics
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.222/0.256/0.285 ms

vmkping with MTU 9000 fails -

[root@esxi01 :~ ] vmkping -I vmk1 <Target storage IP> -d -s 8972
PING <Target storage IP>: 8972 data bytes

<Target storage IP> ping statistics
3 packets transmitted, 0 packets received, 100% packet loss

Note: Once the MTU mismatch has been corrected, the storage should be accessible. ESXi/ESX hosts connected to the inaccessible storage may require a reboot to recover from the storage loss.