VMs have high latency and may freeze due to repeated D:0x28 (TASK_SET_FULL)
search cancel

VMs have high latency and may freeze due to repeated D:0x28 (TASK_SET_FULL)

book

Article ID: 423427

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VM's have high read and write latency and may freeze or fail

  • ESXi hosts may become unresponisive

  • This may occur during periods of higher I/O load

Environment

VMware vSphere ESXi (all versions)

Cause

  • The storage device is overloaded and failing I/Os with D:0x28 SCSI code (TASK_SET_FULL).

  • This may be triggered if a Quality of Service (QoS) IOPS or throughput limit is configured per device and the actual I/O load exceeds these limits. 


    Verification:


    /var/log/vmkernel.log will report logging similar to:

    vmkwarning: cpu1:2098325)WARNING: ScsiDeviceIO: 1779: Device naa.###################### performance has deteriorated. I/O latency increased from average value of 1796 microseconds to 36501 microseconds.

    vmkernel: cpu79:2098326)ScsiDeviceIO: 4619: Cmd(0x45da1181e1c0) 0x28, CmdSN 0x8000006a from world 7998099 to dev "naa.######################" failed H:0x0 D:0x28 P:0x0

    Large numbers of D:0x28 warnings may be logged. 

Resolution

  • It is important to assess the frequency and severity of the issue.

  • If the issue occurs infrequently and/or during short periods when VMs are generating high I/O, then, the problem primarily relates to peak I/O and resolution steps that will typically help are:
    • redististribute intensive workloads across datastores to avoid overloading/exceeding QoS limits on individiual devices, where there remains capacity at storage level to do so.
    • stagger intensive workloads to reduce IOPS/throughput peaks
    • enable adaptive queuing at ESXi level to throttle I/O when TASK_SET_FULL is detected - See Controlling LUN queue depth throttling in VMware ESXi

  • If the issue occurs frequently, is more severe in its impacts, and is triggered by different workloads under more typical workloads, then while the above stesp may help, this may indicate a sizing issue: that more IOPS/throughput capability is required at the storage level (e.g. increasing QoS limits where possible or via addtional/faster hardware)


Change made to manage such issues should be carefully considered in advance and monitored during implementation.