vSAN -- lsi-msgpt35 -- PSOD -- Stucked I/O -- Multiple Drives missing
search cancel

vSAN -- lsi-msgpt35 -- PSOD -- Stucked I/O -- Multiple Drives missing

book

Article ID: 317671

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
Note: This is an extremely rare condition that requires encountering multiple factors simultaneously to trigger.
Affects ESXi Versions before 7.0 Update 3 (19193900).

When using vSAN and a Storage controller utilizing driver lsi_msgpt35, a Host can encounters one or more of the following:
You may see similar messages in vmkernel log (specific times and details will vary) with path, driver, and SCSI H:0x8 errors:
2021-11-24T12:53:09.824Z cpu33:2098069)HPP: HppThrottleLogForDevice:1070: Error status H:0x8 D:0x0 P:0x0 . from device naa.5000c500a19ca6ef repeated 10240 times, hppAction = 3
2021-11-24T12:53:10.596Z cpu12:2098067)ScsiDeviceIO: 4277: Cmd(0x45bc65297ac0) 0x88, CmdSN 0x6c6e100c from world 0 to dev "naa.5000c500a19ca9a3" failed H:0x8 D:0x0 P:0x0
2021-11-24T12:53:11.133Z cpu2:2098067)ScsiDeviceIO: 4277: Cmd(0x45bc6527e0c0) 0x28, CmdSN 0x7b456b19 from world 0 to dev "naa.5000c500a1999bb3" failed H:0x8 D:0x0 P:0x0
2021-11-24T12:53:13.004Z cpu1:10484749)  [HB state abcdef02 offset 4161536 gen 131 stampUS 4294797547629 uuid 615cad72-4a3176a6-3710-4c52624f2444 jrnl <FB 8388608> drv 24.82 lockImpl 4 ip 192.168.195.117]
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_static_config_pages: 4929: TimeSyncInterval value read from Manufacturing page-11 is zero. Periodic Time-Sync will be disabled.
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_display_ioc_capabilities: 4606: SAS3408: FWVersion(14.00.02.00), ChipRevision(0x01), BiosVersion(00.00.00.00)
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_display_ioc_capabilities: 4613: FWPackageVersion(14.00.02.06)
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_send_port_enable: 5362: Command terminated due to timeout
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _debug_dump: 244: Port enable request dump
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: offset:data
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: [0x00]:06000000
2021-11-24T12:53:13.061Z cpu1:2103503)WARNING: lsi_msgpt35_0: _base_make_ioc_operational: 5625: Port Enable failed - Timeout
2021-11-24T12:53:13.061Z cpu56:2097936)lsi_msgpt35_0: _scsih_remove_device: 9717: ENTER: C0:T1, handle(0x0000), sas_addr(0x300705b01088f6e0), portId(0)
2021-11-24T12:53:13.061Z cpu56:2097936)lsi_msgpt35_0: _scsih_remove_device: 9721: ENTER: enclosure level(0x0000), connector name(C1  )
2021-11-24T12:53:13.061Z cpu56:2097936)WARNING: ScsiPath: 11252: Path lost for adapter vmhba0 target 1 channel 0 lun 0
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _ctl_process_mpt_command: 1267: Command terminated due to timeout
2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: msgpt_afd_release: 1431: Diagnostic Trace Buffer was already released
2021-11-24T12:53:13.061Z cpu4:2097584)ScsiPath: 9180: DeletePath : adapter=vmhba0, channel=0, target=1, lun=0
2021-11-24T12:53:13.061Z cpu4:2097584)HPP: HppUnclaimPath:3861: Unclaiming path vmhba0:C0:T1:L0
2021-11-24T12:53:13.061Z cpu4:2097584)ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.
D:0x0 P:0x0 . from device naa.5000c500a1999bb3 repeated 81920 times, hppAction = 3
2021-11-24T12:53:13.500Z cpu4:2097584)ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.
2021-11-24T12:53:13.749Z cpu12:2097584)ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.

 


Cause

lsi_msgpt35 driver versions from 15.xx through 18.00.01.00 have a Bug related to the uptime of the drive resulting in
a potential Window for this issue to occur every 49 days 17 hours 2 minutes 47.295 seconds of uptime.
That occurring Window lasts a few milliseconds until the counter resets to 0. 

During this Window the Controller loses access to the drives.
If certain IOCTL commands are issued to the drives within this Window, further IO will be stuck, IO timeouts will be encountered without the ability to clear the stuck IO. 

Once vSAN detects the stuck IO a PSOD will be initiated or the affected Disk group will be taken offline.

Resolution

Upgrade to lsi_msgpt35 version 18.00.02.00 or higher (as per vSAN HCL guidance for your Build and Controller) as soon as possible. 
(Issue is fixed with 7.0 Update 3 (19193900) Inbox driver )

Workaround:
If PSOD encountered please reboot the Host to clear the condition.
If Disk group offline encountered, please reboot the Host to clear the condition, and recreate the Disk group.

Additional Information

Please see Lenovo advisory of this issue: 
https://support.lenovo.com/ie/en/solutions/ht512561-esxi-node-psod-or-multiple-drives-missing-lenovo-thinksystem

Impact/Risks:
This carries the same risk as any PSOD or Disk group offline actions for vSAN.
During a PSOD VMs running on the Host will crash.
Depending on Storage Policy used and compliance, data may be unavailable during this time period until the Host is rebooted.