Host experienced Retriable errors (H:0xc, H:0x2), the XFS on RDM with vNVMe adapter disables the XFS after 5 retries.

search cancel

Host experienced Retriable errors (H:0xc, H:0x2), the XFS on RDM with vNVMe adapter disables the XFS after 5 retries.

book

Article ID: 313979

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Only when a RHEL virtual machine is operating with XFS configured on physical RDMs, this issue is noticed
There are no similar concerns with VMDKs that use the XFS filesystem
The problem was first noticed in RHEL version 7.1, however it is not exclusive to RHEL 7.1
Issue seen when a host loses a few paths to the storage and at least one path is servicing the IOs correctly.

VMkernel logs of an ESXi host reports retriable errors such as H:0x2, H:0xc as shown below:

var/run/log/vmkernel.log

YYYY-MM-DDThh:mm:ss.000Z cpu31:2098552)ScsiDeviceIO: 3448: Cmd(0x45ac00289e80) 0x2a, CmdSN 0xa40007 from world 5513955 to dev "naa.XXXXXXXX" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0xcf 0x2 0x43.
YYYY-MM-DDThh:mm:ss.000Z cpu31:2098552)ScsiDeviceIO: 3448: Cmd(0x45abc0dc7b40) 0x2a, CmdSN 0x8a0007 from world 5513955 to dev "naa.XXXXXXXX" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0xcf 0x2 0x43.

YYYY-MM-DDThh:mm:ss.040Z cpu28:2098560)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x2a (0x45abc0cbe640, 5513955) to dev "naa.XXXXXXXX" on path "vmhbaX:C#:T#:L##" Failed: H:0x2 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
YYYY-MM-DDThh:mm:ss.040Z cpu28:2098560)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.XXXXXXXX" state in doubt; requested fast path state update...
YYYY-MM-DDThh:mm:ss.040Z cpu28:2098560)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x2a (0x45abc0cd5200, 5513955) to dev "naa.XXXXXXXX" on path "vmhbaX:C#:T#:L##" Failed: H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:NONE
YYYY-MM-DDThh:mm:ss.040Z cpu28:2098560)ScsiDeviceIO: 3448: Cmd(0x45abc0cd5200) 0x2a, CmdSN 0x930006 from world 5513955 to dev "naa.XXXXXXXX" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x63 0x20 0x44.

Guest OS RHEL logs show IO errors as shown below:

/var/log/kern.log

MM DD hh:mm:ss sla70116 kernel: [14099949.568253] blk_update_request: I/O error, dev nvme0n9, sector 22792296
MM DD hh:mm:ss sla70116 kernel: [14099949.960416] blk_update_request: I/O error, dev nvme0n9, sector 15496296
MM DD hh:mm:ss sla70116 kernel: [14099949.960469] blk_update_request: I/O error, dev nvme0n9, sector 23034984

vmware.log for the concerned virtual machine shows entries as found below:

/vmfs/volumes/<datastore_id>/<vm_name>/vmware.log

YYYY-MM-DDThh:mm:ss.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
YYYY-MM-DDThh:mm:ss.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
YYYY-MM-DDThh:mm:ss.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
YYYY-MM-DDThh:mm:ss.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.

Environment

VMware ESXi 6.x, 7.x

Cause

The host will retry the IOs on an other surviving path after H:0xc, H:0x2, and other retriable occurrences. If there is at least one working path between the host and the storage, the IOs will succeed. Pluggable Storage Architecture (PSA) hands over control to Guest Operating System (GOS) (RHEL) in the case of RDMs, and GOS is meant to handle the retries. However, because nvme max retries is set to 5 in GOS as default value for nvme drivers, GOS will retry the IOs for 5 times before marking IO failures. If these IOs are metadata IOs, XFS will prohibit them in order to protect data integrity and disable the filesystem.

Resolution

The guest driver should be tuned for 30 for RHEL guests that use nvme devices served from the vmdk storage to allow sufficient recovery counts and avoid errors back to the XFS layers.
Steps to change the param:
- SSH to GOS and run following command
- echo 30 > /sys/module/nvme_core/parameters/max_retries

Feedback

thumb_up Yes

thumb_down No