At specific times daily, ESXi hosts suffers storage connectivity and performance issues

search cancel

At specific times daily, ESXi hosts suffers storage connectivity and performance issues

book

Article ID: 394232

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts suffer iSCSI storage interruption, connectivity, and performance issues at a specific time daily, or at other regular intervals.

At the specific time,

There is a significant increase in latency:

e.g. /var/log/vmkernel.log reports:
WARNING: ScsiDeviceIO: 1513: Device naa.############################### performance has deteriorated. I/O latency increased from average value of 5601 microseconds to 115207 microseconds.
WARNING: ScsiDeviceIO: 1513: Device naa.############################### performance has deteriorated. I/O latency increased from average value of 5601 microseconds to 249197 microseconds.

I/O fails on paths, with path "state in doubt" messages in /var/log/vmkernel.log:

WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.###############################" state in doubt; requested fast path state update...

Loss of connection to storage targets is reported:

e.g. for iSCSI storage /var/log/vmkernel.log reports:
WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:481: vmhba##:CH:# T:# CN:#: Failed to receive data: Connection reset by peer
WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:484: Sess [ISID: 00023d000001 TARGET: iqn.####-##.###.######:###################### TPGT: 3 TSIH: 0]
WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:485: Conn [CID: 0 L: <vmk IP>:######4 R: <target IP>:3260]
iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1236: vmhba##:CH:# T:# CN:#: Connection rx notifying failure: Failed to Receive. State=Online
iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1237: Sess [ISID: 00023d000001 TARGET: iqn.####-##.###.######:###################### TPGT: 3 TSIH: 0]

Environment

VMware vSphere ESXi 7.0
VMware vSphere ESXi 8.0

Cause

A workload, which is scheduled at a specific time (e.g. backups), significantly increases the load on the storage array and/or storage network/fabric triggering a significant degradation of performance and unresponsiveness of targets.

Resolution

Identify tasks/workloads that are scheduled at the time the issue occurs
Confirm if the performance degradation is accounted for by the performance metrics on the storage array level at those time. If the storage array performance is normal, this indicates a network/fabric level issue.
Investigate with your storage and network teams/vendor support the cause of any performance bottlenecks identified.
Consider changing the scheduling of tasks/workloads that are candidate triggers of the issue (to confirm if the task/workload is the trigger and as a possible remediation).

Feedback

thumb_up Yes

thumb_down No