Micron P420m PCIe SSDs experience command timeout issue

search cancel

Micron P420m PCIe SSDs experience command timeout issue

book

Article ID: 341549

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

The Micron engineering team has confirmed that a low free block state may cause command timeout issues with Micron P420m PCIe SSDs under certain workloads and device conditions. This article is intended to describe how to detect the problem and how to mitigate it.

Symptoms:

When running a Micron P420m PCIe SSD in an ESXi 5.x/6.0 host, you experience one or more of these symptoms:

You see entries in the /var/log/vmkernel.log files, similar to:

13:48:46.456Z cpu8:33564)mtip32xx_native[00000400]: HeavyTimeoutCheck:IO Timeout: [811980076224(ms)] Tag:1 ACTIVE:15371(ms) I:0x4109d5d7fa00 SN:0x628b154 SA=0x00000002 CI=0x00000000
You see frequent abort requests in the /var/log/vmkernel.log file, over a short period of time:

Line 3877: 13:49:17.589Z cpu23:33168)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b259 initiator=0x4109d5d7fa00
Line 3882: 13:49:24.890Z cpu8:2263721)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b268 initiator=0x4109d5d7fa00
Line 3885: 13:49:26.893Z cpu20:2263726)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b267 initiator=0x4109d5d7fa00
Line 3888: 13:49:36.904Z cpu28:2263755)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b279 initiator=0x4109d5d7fa00
Applications such as VMware vSAN have marked the device as Permanent Device Loss (PDL)

Note: For detailed steps to identify the issue, see the Additional Information section.

Environment

VMware vSphere ESXi 5.0
VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.1
VMware vSphere ESXi 5.5

Resolution

P420m customers MUST ensure they are running the latest Micron driver and upgrade firmware to version B21801xx from Service Pack 145.07:

Visit the Micron web site and download SP 145.07
A product change notification, PCN 31615, has been issued with a strong recommendation that customers with virtualization workloads download and install the latest support pack.

Mitigation

To mitigate this issue, follow these steps:

Put the affected ESXi host into maintenance mode.
Obtain the latest service pack (145.07 or later) from the appropriate vendor site:
1. Micron products
2. Dell products
3. EMC products: contact your EMC support representative
Instructions for installing the RSSDM command line utility, drivers and firmware are provided in the installation guide. Look in the service pack download for a file named:
P420m_P320h_HHHL_installation_guide.pdf
Update the device software and firmware:
1. Update your installation of RSSDM command line utility.
2. Update the driver for the version of vSphere you are running:
  - ESXi 5.0 and 5.1: mtip32xx-scsi-2.13.2-esxi50-cert.zip
  - ESXi 5.5 and 6.0: mtip32xx-native-3.9.4-esxi55-cert.zip
3. Update the firmware for each affected device.
Reboot the ESXi host.
Verify your firmware revision is correct using the rssdm command line utility with the -L listing option. Run this command to check the firmware revision:
rssdm -L
Exit maintenance mode on the ESXi host.

Note: If you encounter an issue during the resolution/ mitigation process, contact Micron support at [email protected]

Additional Information

To be alerted when this document is updated, click the Subscribe to Article link in the Actions box.

The problem occurs if IO retries are exhausted and IO's begin to fail indefinitely - i.e. the drive stops responding to ALL command requests.

You can review the kernel logs for certain driver messages indicating that there are repeated IO retry and timeout events happening.

The best way to capture kernel logs and other host support data is to collect a support bundle by running the vm-support command at the local ESXi host shell prompt, and copy the resulting tar.gz file to another computer; such as your laptop. With Windows you can extract the files using the free application 7z tools.

There are two kernel log file locations:

the current vmkernel.log file, located at: /var/log/vmkernel.log
The archives (gzip) of previous logs, located at /var/run/vmkernel.gz*

Note: Depending on when the failure occurred you may have to hunt through the archives to find the proper logs for your failure.

Highlight all driver activities between the device, driver, management interface and host IO stack by performing a file grep on the desired kernel log and filter on the text mtip32xx by running this command:
cat /var/log/vmkernel.log | grep mtip32xx
Look for repeated and frequent signatures that occur in relatively short duration between timeouts and retries.

The following text signatures will filter out log messages sufficiently for both vmklinux (ESXi 5.0 and 5.1) and VMK native (ESXi 5.5 and 6.0) drivers. We recommend you search case insensitive for each or all of these signatures at the same time:

timeout
IssueHbaReset
abort

For example, this text comes directly from a vmkernel.log file in which this issue was found to be the root cause.

All Micron driver kernel log entries are tagged with mtip32xx. Additionally the type of driver is appended to the module name to indicate _scsi for vmk linux driver, and _native for vmk native drivers. These designations are detailed at the kernel DDK level and mostly indicate a slightly different log message encoding, between drivers.

Sample vmkernel.log entries:

13:48:46.456Z cpu8:33564)mtip32xx_native[00000400]: HeavyTimeoutCheck:IO Timeout: [811980076224(ms)] Tag:1 ACTIVE:15371(ms) I:0x4109d5d7fa00 SN:0x628b154 SA=0x00000002 CI=0x00000000

13:48:46.456Z cpu8:33564)WARNING: mtip32xx_native[00000400]: CheckIoTimeoutsTimerEvent:(INTERNAL) timeout detected on device 00000400

13:48:46.456Z cpu8:33564)mtip32xx_native[00000400]: LogExceptionSignal:Signal: --|--|--|--|RS|--|--|--|--|--|--|-- (0x0010) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

13:48:46.456Z cpu4:33563)mtip32xx_native[00000400]: LogExceptionProcess:Process: --|--|--|--|RS|--|--|--|--|--|--|-- (0x0010) Curr: --|--|--|--|RS|--|--|--|--|--|--|-- (0x0010)

13:48:46.456Z cpu4:33563)mtip32xx_native[00000400]: ExceptionHandlerWorld:Performing device reset request.

13:48:46.958Z cpu4:33563)mtip32xx_native[00000400]: IssueHbaReset:Issuing HBA reset.

13:48:47.376Z cpu20:34835)World: 14302: VC opID hostd-71c2 maps to vmkernel opID 2f8ef18e

13:48:47.513Z cpu4:33563)mtip32xx_native[00000400]: ProcessActiveCommands:Commands completed: 0, re-issued: 35

13:48:47.516Z cpu4:33563)mtip32xx_native[00000400]: ExceptionHandlerWorld:Device reset complete.

The previous vmkernel.log sample outlines the first occurrence. In this scenario the kernel log entries are marked with mtip32xx_native which is the driver module name for ESXi 5.5 and ESXi 6.0 installations. In square brackets, the PCI Bus:Dev:Func identifier is provided, which is useful when you have multiple devices in a single system.

The line item with HeavyTimeoutCheck:IO Timeout indicates that the driver detected a potential problem in which the device stopped responding to command completions for more than 20 seconds. This is not unreasonable under normal, heavy write workloads, but it is suspicious.

The Tag field indicates the unique HW command slot that the command was issued on; there are 1 to 255 slots; 0 is reserved for internal use. The ACTIVE field indicates the amount of time since the command was issued to the device in milliseconds.

The I: and SN: fields are the SCSI Initiator and Target identifiers for the command in question. In this case, the IO issued at time stamp 13:48:46.456Z has already been outstanding about 15 seconds. In this case, the driver is concerned about the total time since the last IO completion occurred, not the actual outstanding command time. As a result, it is normal to see a device reset followed by a re-issue of any outstanding commands. Notice the line with the text ProcessActiveCommands followed by ExceptionHandlerWorld:Device reset complete. This is where any commands that were outstanding prior to HBA reset are re-issued and the reset process is complete; 35 commands in this example.

The next log text sequence is the beginning of an indication that the device is experiencing the command timeout issue. The primary indicator is the time stamp since the last timeout detected; in this case time stamp 13:48:47.516Z is the end of the timeout event, and the beginning of the next one is time stamp 13:49:06.456Z; a 58s time difference. This is an indicator that the logs should be looked at for other markers, such as command aborts (SCSI task management requests).

Sample vmkernel.log with Timeout and Abort Indicators:

13:49:06.456Z cpu8:33564)mtip32xx_native[00000400]: HeavyTimeoutCheck:IO Timeout: [812000076212(ms)] Tag:1 ACTIVE:18868(ms) I:0x4109d5d7fa00 SN:0x628b259 SA=0x00000002 CI=0x00000000

13:49:06.456Z cpu8:33564)WARNING: mtip32xx_native[00000400]: CheckIoTimeoutsTimerEvent:(INTERNAL) timeout detected on device 00000400

13:49:06.456Z cpu8:33564)mtip32xx_native[00000400]: LogExceptionSignal:Signal: --|--|--|--|RS|--|--|--|--|--|--|-- (0x0010) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

13:49:06.456Z cpu4:33563)mtip32xx_native[00000400]: LogExceptionProcess:Process: --|--|--|--|RS|--|--|--|--|--|--|-- (0x0010) Curr: --|--|--|--|RS|--|--|--|--|--|--|-- (0x0010)

13:49:06.456Z cpu4:33563)mtip32xx_native[00000400]: ExceptionHandlerWorld:Performing device reset request.

13:49:06.959Z cpu4:33563)mtip32xx_native[00000400]: IssueHbaReset:Issuing HBA reset.

13:49:07.514Z cpu4:33563)mtip32xx_native[00000400]: ProcessActiveCommands:Commands completed: 0, re-issued: 33

13:49:17.589Z cpu23:33168)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b259 initiator=0x4109d5d7fa00

13:49:17.589Z cpu23:33168)mtip32xx_native[00000400]: ahciAbortIO:(curr) HWQD: 36 UAWQD: 0 UAWL: 0 BusyL: 3

13:49:20.022Z cpu20:34880)World: 14302: VC opID hostd-6f58 maps to vmkernel opID c974c357

13:49:22.707Z cpu30:36608)World: 14302: VC opID hostd-6f58 maps to vmkernel opID c974c357

13:49:24.889Z cpu8:2263721)ScsiCore: 63: Starting taskmgmt handler world 2263721/2

...

Continuing with the log text above, we see an abort request being logged by the driver with the tag scsiTaskMgmtCommand:VMK Task: ABORT. Abort requests are rare but can occur under normal drive operation, depending on two factors

The percentage of life remaining on the drive.
Asynchronous power cycle events; including hot plug removal and re-insertion/removal of server power supply.

Adding an additional filter for scsiTaskMgmtCommand allows you to see the overall number of abort requests and the drivers response to those requests.

Sample of vmkernel.log with filter scsiTaskMgmtCommand:

Line 3877: 13:49:17.589Z cpu23:33168)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b259 initiator=0x4109d5d7fa00

Line 3882: 13:49:24.890Z cpu8:2263721)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b268 initiator=0x4109d5d7fa00

Line 3885: 13:49:26.893Z cpu20:2263726)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b267 initiator=0x4109d5d7fa00

Line 3888: 13:49:36.904Z cpu28:2263755)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b279 initiator=0x4109d5d7fa00

... (165 messages of the same nature)

Line 4287: 13:52:33.528Z cpu16:2263755)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b3c2 initiator=0x4109d5d7fa00

Line 4463: 13:53:57.843Z cpu4:2263851)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x3641c initiator=0x4123e9a9dc40

Line 4496: 13:55:57.300Z cpu6:2263905)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b3b3 initiator=0x4109d5d7fa00

Line 4516: 13:57:57.311Z cpu9:2263866)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b3b8 initiator=0x4109d5d7fa00

In this log snippet, filtered by scsiTaskMgmtCommand, you see hundreds of abort command requests over the course of 7 minutes. This is the secondary indicator that there is a timeout issue with the drive.

Next we will take a look at one of the abort requests to explain what the driver is doing and why it indicates a command timeout issue. The following text is gathered by searching the vmkernel.log for a single command serial number, in this case sn = "0x628b3bf".

Sample vmkernel.log following a single command abort request.

Line 4283: 13:52:33.411Z cpu18:2263721)mtip32xx_native[00000400]: scsiTaskMgmtCommand:VMK Task: ABORT sn=0x628b3bf initiator=0x4109d5d7fa00

Line 4548: 13:58:47.184Z cpu20:33563)mtip32xx_native[00000400]: ProcessAbortRequest:aborted command I:0x4109d5d7fa00 SN:0x628b3bf tag:4

Line 4551: 13:58:47.184Z cpu24:32845)ScsiDeviceIO: 2338: Cmd(0x412f7cb94500) 0x28, CmdSN 0x628b3bf from world 0 to dev "t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000014090C072C5F" failed H:0x5 D:0x22 P:0x0 Possible sense data: 0x0 0x0 0x0.

There are many lines of log file text surrounding the above sequence but if we narrow in around Line 4548 (referenced above) we see the following log file entries:

Sample vmkernel.log surrounding command abort request.

13:58:47.184Z cpu18:2263721)mtip32xx_native[00000400]: LogExceptionSignal:Signal: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

13:58:47.184Z cpu20:33563)mtip32xx_native[00000400]: LogExceptionProcess:Process: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008)

13:58:47.184Z cpu20:33563)mtip32xx_native[00000400]: ProcessAbortRequest:aborted command I:0x4109d5d7fa00 SN:0x628b3bf tag:4

13:58:47.184Z cpu20:33563)mtip32xx_native[00000400]: ExceptionHandlerWorld:Abort scan took 13 (us) to complete, 1 commands aborted.

13:58:47.184Z cpu20:33563)mtip32xx_native[00000400]: IssueHbaReset:Issuing HBA reset.

13:58:47.184Z cpu24:32845)ScsiDeviceIO: 2338: Cmd(0x412f7cb94500) 0x28, CmdSN 0x628b3bf from world 0 to dev "t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000014090C072C5F" failed H:0x5 D:0x22 P:0x0 Possible sense data: 0x0 0x0 0x0.

13:58:47.740Z cpu20:33563)mtip32xx_native[00000400]: ahciCheckHealthStatus:Performing health check...

13:58:47.740Z cpu20:33563)mtip32xx_native[00000400]: IssueComReset:Issuing COM reset.

13:58:47.807Z cpu20:33563)mtip32xx_native[00000400]: ahciCheckHealthStatus:Health Status Check - ReadLogExt 10h buffer:00 00 00 00 00 00 00 00 00 00 00 00 00 00

13:58:47.819Z cpu20:33563)mtip32xx_native[00000400]: ahciReadSmartAttributes:SMART data acquired version 0x000a

13:58:47.819Z cpu20:33563)mtip32xx_native[00000400]: ahciCheckHealthStatus:SMART: Write Protect status: 0% (104857 blocks) on device 00000400

13:58:47.819Z cpu20:33563)mtip32xx_native[00000400]: ProcessActiveCommands:Commands completed: 0, re-issued: 2

13:58:47.824Z cpu5:2263866)WARNING: LSOMCommon: IORETRYParentIODoneCB:1043: Throttled: split status Maximum kernel-level retries exceeded

13:58:47.824Z cpu5:2263866)WARNING: LSOM: RCSsdIoCompletion:166: Throttled: RC SSD IO failed. Wake up 0x412f3ad2ac80 with status Maximum kernel-level retries exceeded

13:58:47.830Z cpu14:2263794)mtip32xx_native[00000400]: LogExceptionSignal:Signal: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

13:58:47.830Z cpu18:2263721)WARNING: LSOM: RCWbIoCompletion:137: Throttled: WB/CF IO failed. Wake up 0x412f3aca7880 with status Maximum kernel-level retries exceeded

13:58:47.830Z cpu20:33563)mtip32xx_native[00000400]: LogExceptionProcess:Process: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008)

13:58:47.830Z cpu20:33563)mtip32xx_native[00000400]: ExceptionHandlerWorld:No outstanding commands, nothing to abort.

13:58:47.830Z cpu36:33716)WARNING: LSOM: RCDrainAfterBERead:5070: Changing the status of child state from Success to Maximum kernel-level retries exceeded

13:58:47.830Z cpu36:33716)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:599: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000014090C072C5F:2

13:58:47.830Z cpu36:33716)WARNING: LSOM: LSOMEventNotify:4574: VSAN device 52e5eda8-9027-aa59-611c-2cd420269225 is under permanent error.

13:58:47.842Z cpu14:2263764)mtip32xx_native[00000400]: LogExceptionSignal:Signal: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

13:58:47.845Z cpu20:33563)mtip32xx_native[00000400]: LogExceptionProcess:Process: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008)

13:58:47.845Z cpu20:33563)mtip32xx_native[00000400]: ExceptionHandlerWorld:Abort scan took 1 (us) to complete, 0 commands aborted.

13:58:47.854Z cpu18:2263755)mtip32xx_native[00000400]: LogExceptionSignal:Signal: --|--|--|AB|--|--|--|--|--|--|--|-- (0x0008) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

13:58:47.854Z cpu36:33716)WARNING: LSOM: LSOMRecoveryWriteDispatch:687: Unable to write log entry Maximum kernel-level retries exceeded

Here is an explanation of what you are seeing, along with a description of the end results of these log entries.

The log line with the tag ProcessAbortRequest is where the driver has recognized the abort request and is taking action.
The line with tag ExceptionHandlerWorld:Abort scan indicates how long the driver spent actually aborting the IO.
The line with ahciCheckHealthStatus:SMART: Write Protect status: is where the drive reports its percent life remaining to the driver. In this case the drive is at 0% life consumed.

With the drive life at 100%, the number of aborts in the hundreds within a short duration (~5 minutes), it is probable that this drive is experiencing the low free block condition which results in command timeouts. This is the point where applications, such as vSAN, become "unhappy" with the device and begin to mark a device as PDL.

For more information about the latest drivers and firmware for affected products visit Micron at www.micron.com/products/solid-state-storage/bus-interfaces/pcie-ssds

Micron P420m PCIe SSD でコマンドのタイムアウトが発生する

Feedback

thumb_up Yes

thumb_down No