At specific times daily, ESXi hosts suffers storage connectivity and performance issues
search cancel

At specific times daily, ESXi hosts suffers storage connectivity and performance issues

book

Article ID: 394232

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts suffer iSCSI storage interruption, connectivity, and performance issues at a specific time daily, or at other regular intervals. 

At the specific time, 

  • There is a significant increase in latency:

e.g. /var/log/vmkernel.log reports:
WARNING: ScsiDeviceIO: 1513: Device naa.############################### performance has deteriorated. I/O latency increased from average value of 5601 microseconds to 115207 microseconds.
WARNING: ScsiDeviceIO: 1513: Device naa.############################### performance has deteriorated. I/O latency increased from average value of 5601 microseconds to 249197 microseconds.

  • I/O fails on paths, with path "state in doubt" messages in /var/log/vmkernel.log:

WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.###############################" state in doubt; requested fast path state update...

  • Loss of connection to storage targets is reported:

e.g. for iSCSI storage /var/log/vmkernel.log reports: 
WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:481: vmhba##:CH:# T:# CN:#: Failed to receive data: Connection reset by peer
WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:484: Sess [ISID: 00023d000001 TARGET: iqn.####-##.###.######:###################### TPGT: 3 TSIH: 0]
WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:485: Conn [CID: 0 L: <vmk IP>:######4 R: <target IP>:3260]
iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1236: vmhba##:CH:# T:# CN:#: Connection rx notifying failure: Failed to Receive. State=Online
iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1237: Sess [ISID: 00023d000001 TARGET: iqn.####-##.###.######:###################### TPGT: 3 TSIH: 0]

Environment

VMware vSphere ESXi 7.0
VMware vSphere ESXi 8.0

Cause

A workload, which is scheduled at a specific time (e.g. backups), significantly increases the load on the storage array and/or storage network/fabric triggering a significant degradation of performance and unresponsiveness of targets. 


Resolution

  • Identify tasks/workloads that are scheduled at the time the issue occurs

  • Confirm if the performance degradation is accounted for by the performance metrics on the storage array level at those time. If the storage array performance is normal, this indicates a network/fabric level issue.
  • Investigate with your storage and network teams/vendor support the cause of any performance bottlenecks identified.

  • Consider changing the scheduling of tasks/workloads that are candidate triggers of the issue (to confirm if the task/workload is the trigger and as a possible remediation).