Qlogic "qlnativefc" driver reports an enormous amount of events with "DIF Error" and I/O errors to the target LUNs
search cancel

Qlogic "qlnativefc" driver reports an enormous amount of events with "DIF Error" and I/O errors to the target LUNs

book

Article ID: 318583

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Symptom - 1:- 

  • "DIF Error" in vmkernel logs

2020-04-06T08:25:12.154Z cpu74:2098447)qlnativefc: vmhba2(af:0.0): iocb(s) 0x430a045f7340 Returned STATUS.
2020-04-06T08:25:12.154Z cpu74:2098447)qlnativefc: vmhba2(af:0.0): DIF ERROR in cmd: 0x28 Type=0x0 lba=0xb100 actRefTag=0x1000000, expRefTag=0xb100, actAppTag=0x0, expAppTag=0x0, actGuard=0x400, expGuard=0xa671.
2020-04-06T08:25:12.154Z cpu74:2098447)ScsiDeviceIO: 3449: Cmd(0x45a74c43f980) 0x28, CmdSN 0x70af4 from world 2100597 to dev "naa.60060e80165251000001973100000105" failed H:0xf D:0x0 P:0x0 Invalid sense data: 0x1a 0x1b 0x45.
2020-04-06T08:25:12.155Z cpu0:2107305)qlnativefc: vmhba2(af:0.0): iocb(s) 0x430a04577780 Returned STATUS.
2020-04-06T08:25:12.155Z cpu0:2107305)qlnativefc: vmhba2(af:0.0): DIF ERROR in cmd: 0x28 Type=0x0 lba=0xb100 actRefTag=0x1000000, expRefTag=0xb100, actAppTag=0x0, expAppTag=0x0, actGuard=0x400, expGuard=0xa671.
2020-04-06T08:25:12.155Z cpu2:2098444)ScsiDeviceIO: 3449: Cmd(0x459b62387340) 0x28, CmdSN 0x70af5 from world 2100597 to dev "naa.60060e80165251000001973100000105" failed H:0xf D:0x0 P:0x0 Invalid sense data: 0x4f 0x2 0x43.

Additional symptoms might also include:

  • Disk I/O operation failures reported by VMs
  • Filesystem in Linux Guest OS become read-only due to underlying disk I/O latency or failure
  • Unresponsive or Sluggish ESXi host 
  • The high number of H:0xf SCSI error codes in vmkernel logs

A large number of issues observed with qlnativefc driver version 3.1.29.0 and qlnativefc driver version 3.1.31.0, but not limited to these driver versions. 
 

Symptom - 2:- 

  • "Data Integrity Field (DIF) Error" in VMkernel logs appears only if the Debugging is enabled in the qlnativefc Driver 
2020-07-31T10:01:02.130Z cpu0:66211)qlnativefc: vmhba1(37:0.0): Data phase error, rediscover DIF capability : senseKey = 0x5 : asc = 0x4b : ascq = 0x82
2020-07-31T10:01:02.143Z cpu17:480587)qlnativefc: vmhba1(37:0.0): Data phase error, rediscover DIF capability : senseKey = 0x5 : asc = 0x4b : ascq = 0x82
2020-07-31T10:01:02.144Z cpu0:2122267)qlnativefc: vmhba1(37:0.0): Data phase error, rediscover DIF capability : senseKey = 0x5 : asc = 0x4b : ascq = 0x82
  • Disk I/O operation failures reported by VMs
  • Unresponsive Virtual Machines 
  • Unresponsive or Sluggish ESXi host 
  • Hostd reporting unresponsive 
  • The VMkernel logs shows too many "State in doubt, requested fast path state update" messages 
2020-07-26T22:40:56.347Z cpu17:66374)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.xxx.ID" state in doubt; requested fast path state update...
2020-07-26T22:40:56.546Z cpu17:66374)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.xxx.ID" state in doubt; requested fast path state update...
2020-07-26T22:41:56.346Z cpu0:66374)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.xxx.ID" state in doubt; requested fast path state update...
  • I/O Error reported to the LUNs very specific to the ones which have DIF disabled from the Storage Array 
2020-07-29T06:15:00.439Z cpu23:65624)ScsiDeviceIO: 3015: Cmd(0x439a86176840) 0x28, CmdSN 0x49855f from world 0 to dev "naa.xxx-ID" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2020-07-29T06:15:00.439Z cpu15:2952583)Partition: 427: Failed read for "naa.xxx-ID": I/O error
2020-07-29T06:15:00.439Z cpu15:2952583)Partition: 1007: Failed to read protective mbr on "naa.xxx-ID" : I/O error
  • This was mostly observed with a combination of 3Pardata HP Arrays which have DIF disabled LUNs and qlnativefc driver with DIF enabled by default within the driver
  • HBA Adapter Failed (Read, Blocks Read and Written) are way to high in such cases, refer the HBA stats for the info 
vmhba1:
   Successful Commands: 728172193
   Blocks Read: 44452702311
   Blocks Written: 13163331407
   Read Operations: 325720036
   Write Operations: 355132847
   Reserve Operations: 6142
   Reservation Conflicts: 431652
   Failed Commands: 48551498
   Failed Blocks Read: 403288824638317
   Failed Blocks Written: 311812910615
   Failed Read Operations: 48078480
   Failed Write Operations: 50916
  • The issue was observed in the below versions of drivers 
vSphere 6.5 :- 2.1.96.0
vSphere 6.7 :- 3.1.31.0
The issue was also observed in vSphere 7.0 
  • The issue was also mostly observed with QLogic QLE2690 Single Port 16Gb Fibre Channel Adapters and the same vendor QLogic ISP2532-based 8Gb Fibre Adapters had no issues reported  


Environment

VMware vSphere 7.0.x
VMware vSphere 6.7.x
VMware vSphere 6.5.x

Cause

  • The 3PAR storage reports an illegal request made by the driver.
  • The issue was the Qlogic driver would mistakenly send Data Integrity Field (DIF) enabled I/O to a LUN that did not support it, resulting in the error seen in the logs.

Resolution

  • To Enable Debug logs on qlnativefc driver follow the below steps, NOTE:- The DIF errors in the VMkernel logs are only observed if the driver logging is changed to debug mode.
    • Dynamically without requiring reboot:
      $ /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -i MOD_PARM/qlogic -s scsi-qlaenable-log -k DRIVERINFO
    • To make persistent across reboots:
      $ esxcfg-module -s "ql2xextended_error_logging=1" qlnativefc (reboot required)
  • Once the Debug is enabled you will see the DIF errors on the logs reported
  • The issue is fixed in the Qlogic Driver versions mentioned below 
    vSphere 6.5 :- Qlogic Driver version : 2.1.101.0 
    vSphere 6.7 :- Qlogic Driver version : 3.1.36.0  
    vSphere 7.0 :- Qlogic Driver version : 4.1.14.0 
  • Below is the command to disable the Debugging on the qlnativefc driver
    • ‘esxcfg-module -s “ql2xextended_error_logging=0” qlnativefc’  this command will require a reboot
    • “/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -i MOD_PARM/qlogic -s scsi-qladisable-log -k DRIVERINFO”  this command does not require a reboot
NOTE:- Symptom -1 mentioned above was only noticed in the combination of vSphere 6.7 version of qlnativefc driver, however the fix and the workaround for both Symptom 1 & 2 are the same and the fixes are available in the above mentioned Driver release.

Workaround:
  • Workaround is to disable the DIF from the driver, follow the steps below, this will address the DIF errors reported
    •  esxcfg-module -s “ql2xt10difvendor=0” qlnativefc
  • ESXi Host Reboot is required once you disable the parameter in the Driver