Host experienced Retriable errors (H:0xc, H:0x2), the XFS on RDM with vNVMe adapter disables the XFS after 5 retries.
search cancel

Host experienced Retriable errors (H:0xc, H:0x2), the XFS on RDM with vNVMe adapter disables the XFS after 5 retries.

book

Article ID: 313979

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Assist in identifying the problem and suggesting solutions in a timely manner.

Symptoms:

Symptoms:

  • Only when a RHEL virtual machine is operating with XFS configured on physical RDMs is this issue noticed

  • There are no similar concerns with VMDKs that use the XFS filesystem

  • The problem was first noticed in RHEL version 7.1, however it is not exclusive to RHEL 7.1

  • Issue seen when host loses few paths to the storage and at least one path is servicing the IOs correctly

Log events: 

# ESXi host seeing retriable errors such as H:0x2, H:0xc etc.
2022-05-27T13:33:45.000Z cpu31:2098552)ScsiDeviceIO: 3448: Cmd(0x45ac00289e80) 0x2a, CmdSN 0xa40007 from world 5513955 to dev "naa.60060e8008cafe000050cafe00004154" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0xcf 0x2 0x43.
2022-05-27T13:33:45.000Z cpu31:2098552)ScsiDeviceIO: 3448: Cmd(0x45abc0dc7b40) 0x2a, CmdSN 0x8a0007 from world 5513955 to dev "naa.60060e8008cafe000050cafe00004154" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0xcf 0x2 0x43.
2022-05-27T13:33:45.001Z cpu31:2098552)ScsiDeviceIO: 3448: Cmd(0x45ac0031d540) 0x2a, CmdSN 0x670007 from world 5513955 to dev "naa.60060e8008cafe000050cafe00004154" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0xcf 0x2 0x43.
2022-05-27T13:33:45.001Z cpu31:2098552)ScsiDeviceIO: 3448: Cmd(0x45abc0db0f80) 0x2a, CmdSN 0x1d0007 from world 5513955 to dev "naa.60060e8008cafe000050cafe00004154" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0xcf 0x2 0x43.
# Guest OS RHEL sees IO errors
May 27 15:33:46 sla70116 kernel: [14099949.568253] blk_update_request: I/O error, dev nvme0n9, sector 22792296
May 27 15:33:47 sla70116 kernel: [14099949.960416] blk_update_request: I/O error, dev nvme0n9, sector 15496296
May 27 15:33:47 sla70116 kernel: [14099949.960469] blk_update_request: I/O error, dev nvme0n9, sector 23034984
May 27 15:33:47 sla70116 kernel: [14099949.960500] blk_update_request: I/O error, dev nvme0n9, sector 7813736
May 27 15:33:47 sla70116 kernel: [14099949.960577] blk_update_request: I/O error, dev nvme0n9, sector 23008360
May 27 15:33:47 sla70116 kernel: [14099949.960578] blk_update_request: I/O error, dev nvme0n9, sector 7981672
May 27 15:33:47 sla70116 kernel: [14099949.960581] blk_update_request: I/O error, dev nvme0n9, sector 7264360
May 27 15:33:47 sla70116 kernel: [14099949.960584] blk_update_request: I/O error, dev nvme0n9, sector 7901288
May 27 15:33:47 sla70116 kernel: [14099949.960588] blk_update_request: I/O error, dev nvme0n9, sector 23022184
May 27 15:33:47 sla70116 kernel: [14099949.960592] blk_update_request: I/O error, dev nvme0n9, sector 22760040
# Vmx logs shows:
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
2022-05-27T13:33:45.180Z| vcpu-41| I125: NVME-VMK: nvme1:8: WRITE Command failed. Status: 0x0/0x82.
### Above events are one of the occurrence and not limited to it. 

Environment

VMware vSphere 6.7.x
VMware vSphere 7.0.x
VMware vSphere 6.x
VMware vSphere 6.5.x

Cause

The host will retry the IOs on an other surviving path after H:0xc, H:0x2, and other retriable occurrences. If there is at least one working path between the host and the storage, the IOs will succeed. Pluggable Storage Architecture (PSA) hands over control to Guest Operating System (GOS) (RHEL) in the case of RDMs, and GOS is meant to handle the retries. However, because nvme max retries is set to 5 in GOS as default value  for nvme drivers, GOS will retry the IOs for  5 times before marking IO failures. If these IOs are metadata IOs, XFS will prohibit them in order to protect data integrity and disable the filesystem.

Resolution

  • The guest driver should be tuned for 30 for RHEL guests that use nvme devices served from the vmdk storage to allow sufficient recovery counts and avoid errors back to the XFS layers.
  • Steps to change the param: 
SSH to GOS and run following command 
# echo 30 > /sys/module/nvme_core/parameters/max_retries


Additional Information

Impact/Risks:
  • ​​​​​IO errors in GOS may occur when the maximum number of retries are exceeded more than 5.
  • As XFS disable the filesystem for data safety VM goes into hung state