vSAN -- lsi-msgpt35 -- PSOD -- Stucked I/O -- Multiple Drives missing
search cancel

vSAN -- lsi-msgpt35 -- PSOD -- Stucked I/O -- Multiple Drives missing

book

Article ID: 317671

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
Note: This is an extremely rare condition that requires encountering multiple factors simultaneously to trigger.
Affects ESXi Versions before 7.0 Update 3 (19193900).

When using vSAN and a Storage controller utilizing driver lsi_msgpt35, a Host can encounters one or more of the following:
You may see similar messages in vmkernel log (specific times and details will vary) with path, driver, and SCSI H:0x8 errors:
 
HPP: HppThrottleLogForDevice:1070: Error status H:0x8 D:0x0 P:0x0 . from device naa.xxxxxxxxxxxxxxxx repeated xxxxx times, hppAction = 3
ScsiDeviceIO: 4277: Cmd(0x45bc65297ac0) 0x88, CmdSN 0x6c6e100c from world 0 to dev "naa.xxxxxxxxxxxxxxxx" failed H:0x8 D:0x0 P:0x0
ScsiDeviceIO: 4277: Cmd(0x45bc6527e0c0) 0x28, CmdSN 0x7b456b19 from world 0 to dev "naa.xxxxxxxxxxxxxxxx" failed H:0x8 D:0x0 P:0x0
[HB state abcdef02 offset 4161536 gen 131 stampUS xxxxxxxxxxxxxx uuid xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx jrnl <FB 8388608> drv 24.82 lockImpl 4 ip XXX.XXX.XXX.XXX]
lsi_msgpt35_0: _base_static_config_pages: 4929: TimeSyncInterval value read from Manufacturing page-11 is zero. Periodic Time-Sync will be disabled.
lsi_msgpt35_0: _base_display_ioc_capabilities: 4606: SAS3408: FWVersion(14.00.02.00), ChipRevision(0x01), BiosVersion(00.00.00.00)
lsi_msgpt35_0: _base_display_ioc_capabilities: 4613: FWPackageVersion(14.00.02.06)
lsi_msgpt35_0: _base_send_port_enable: 5362: Command terminated due to timeout
lsi_msgpt35_0: _debug_dump: 244: Port enable request dump
lsi_msgpt35_0: offset:data
lsi_msgpt35_0: [0x00]:06000000
WARNING: lsi_msgpt35_0: _base_make_ioc_operational: 5625: Port Enable failed - Timeout
lsi_msgpt35_0: _scsih_remove_device: 9717: ENTER: C0:T1, handle(0x0000), sas_addr(xxxxxxxxxxxxxxxxxx), portId(0)
lsi_msgpt35_0: _scsih_remove_device: 9721: ENTER: enclosure level(0x0000), connector name(C1  )
WARNING: ScsiPath: 11252: Path lost for adapter vmhba0 target 1 channel 0 lun 0
lsi_msgpt35_0: _ctl_process_mpt_command: 1267: Command terminated due to timeout
lsi_msgpt35_0: msgpt_afd_release: 1431: Diagnostic Trace Buffer was already released
ScsiPath: 9180: DeletePath : adapter=vmhba0, channel=0, target=1, lun=0
HPP: HppUnclaimPath:3861: Unclaiming path vmhba0:C0:T1:L0
ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.
D:0x0 P:0x0 . from device naa.xxxxxxxxxxxxxxxx repeated xxxxx times, hppAction = x
ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.
ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.

 


Cause

lsi_msgpt35 driver versions from 15.xx through 18.00.01.00 have a Bug related to the uptime of the drive resulting in a potential Window for this issue to occur every 49 days 17 hours 2 minutes 47.295 seconds of uptime.


That occurring Window lasts a few milliseconds until the counter resets to 0. 

During this Window the Controller loses access to the drives.


If certain IOCTL commands are issued to the drives within this Window, further IO will be stuck, IO timeouts will be encountered without the ability to clear the stuck IO. 

Once vSAN detects the stuck IO a PSOD will be initiated or the affected Disk group will be taken offline.

Resolution

Upgrade to lsi_msgpt35 version 18.00.02.00 or higher (as per vSAN HCL guidance for your Build and Controller) as soon as possible. 
(Issue is fixed with 7.0 Update 3 (19193900) Inbox driver )

Workaround:
If PSOD encountered please reboot the Host to clear the condition.
If Disk group offline encountered, please reboot the Host to clear the condition, and recreate the Disk group.

Additional Information

Please see Lenovo advisory of this issue: 

ESXi node PSOD or multiple drives missing when using lsi-msgpt35 driver version equal or prior to 18.00.01.00

Impact/Risks:
This carries the same risk as any PSOD or Disk group offline actions for vSAN.
During a PSOD VMs running on the Host will crash.
Depending on Storage Policy used and compliance, data may be unavailable during this time period until the Host is rebooted.