VMs have high latency and may freeze due to repeated D:0x28 (TASK_SET

search cancel

book

calendar_today

VMware vSphere ESXi

VMware vSphere ESXi (all versions)

The storage device is overloaded and failing I/Os with D:0x28 SCSI code (TASK_SET_FULL).
This may be triggered if a Quality of Service (QoS) IOPS or throughput limit is configured per device and the actual I/O load exceeds these limits.

Verification:

/var/log/vmkernel.log will report logging similar to:

vmkwarning: cpu1:2098325)WARNING: ScsiDeviceIO: 1779: Device naa.###################### performance has deteriorated. I/O latency increased from average value of 1796 microseconds to 36501 microseconds.
…
vmkernel: cpu79:2098326)ScsiDeviceIO: 4619: Cmd(0x45da1181e1c0) 0x28, CmdSN 0x8000006a from world 7998099 to dev "naa.######################" failed H:0x0 D:0x28 P:0x0

Large numbers of D:0x28 warnings may be logged.

It is important to assess the frequency and severity of the issue.
If the issue occurs infrequently and/or during short periods when VMs are generating high I/O, then, the problem primarily relates to peak I/O and resolution steps that will typically help are:
- redististribute intensive workloads across datastores to avoid overloading/exceeding QoS limits on individiual devices, where there remains capacity at storage level to do so.
- stagger intensive workloads to reduce IOPS/throughput peaks
- enable adaptive queuing at ESXi level to throttle I/O when TASK_SET_FULL is detected - See Controlling LUN queue depth throttling in VMware ESXi
If the issue occurs frequently, is more severe in its impacts, and is triggered by different workloads under more typical workloads, then while the above stesp may help, this may indicate a sizing issue: that more IOPS/throughput capability is required at the storage level (e.g. increasing QoS limits where possible or via addtional/faster hardware)

Change made to manage such issues should be carefully considered in advance and monitored during implementation.

thumb_up Yes

thumb_down No