Issue:
Hosts becoming unresponsive, datastores becoming inaccessible and storage performance degradation.
Devices observing high latency.
Here are some log excerpts from hosts having this issue.
hostd.log:
YYYY-MM-DDTHH:MM:SS.sssZ info hostd[2100788] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 3984 : Lost access to volume (Datastore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2100744] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 4233 : Lost access to volume (Datastore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2100744] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 4233 : Lost access to volume (Datastore) due to connectivity issues.Recovery attempt is in progress and outcome will be reported shortly.
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2100744] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {--> key = 6,
--> chainId = 0,
--> createdTime = "1970-01-01T00:00:00Z",
--> userName = "",
--> host = (vim.event.HostEventArgument) {
--> name = "ESXi-1.vmware.local",
--> host = 'vim.HostSystem:ha-host'
--> },
--> eventTypeId = "esx.problem.vmfs.heartbeat.recovered",
--> arguments = (vmodl.KeyAnyValue) [
--> (vmodl.KeyAnyValue) {
--> key = "1",
--> value = ""
--> },
--> (vmodl.KeyAnyValue) {
--> key = "2",
--> value = "Datastore"
--> }
--> ],
--> objectId = "ha-host",
--> objectType = "vim.HostSystem",
--> }
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2100744] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 4234 : Successfully restored access to volume (Datastore) following connectivity issues.
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2101035] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 4209 : Path redundancy to storage device naa.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T101:L7 is down. Affected datastores: Datastore-1.
YYYY-MM-DDTHH:MM:SS.sss
Z info hostd[2101034] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 4210 : Path redundancy to storage device naa.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T100:L7 is down. Affected datastores: Datastore-1.
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2101035] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 4209 : Path redundancy to storage device naa.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T101:L7 is down. Affected datastores: .
YYYY-MM-DDTHH:MM:SS.sssZ
info hostd[2101034] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {--> key = 6,
--> chainId = -1,
--> createdTime = "1970-01-01T00:00:00Z",
--> userName = "",
--> host = (vim.event.HostEventArgument) {
--> name = "ESXi-2.vmware.local",
--> host = 'vim.HostSystem:ha-host'
--> },
--> ds = (vim.event.DatastoreEventArgument) {
--> name = "Datastore-1",
--> datastore = 'vim.Datastore:'
--> },
--> eventTypeId = "esx.problem.storage.redundancy.degraded",
--> arguments = (vmodl.KeyAnyValue) [
--> (vmodl.KeyAnyValue) {
--> key = "1",
--> value = "naa.xxxxxxxxxxxxxx "
--> },
--> (vmodl.KeyAnyValue) {
--> key = "2",
--> value = "vmhba0:C0:T100:L7"
--> },
--> (vmodl.KeyAnyValue) {
--> key = "3",
--> value = (vim.event.DatastoreEventArgument) {
--> name = "Datastore-1",
--> datastore = 'vim.Datastore:'
--> }
--> }
--> ],
--> objectId = " ",
--> objectType = "vim.Datastore",
--> objectName = "Datastore",
--> }
VMkernel.log:
YYYY-MM-DDTHH:MM:SS.sssZ
cpu0:2118328)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:843: Path (vmhba1:C0:T23:L1) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x29 0x3. Waiting for 20 seconds $
YYYY-MM-DDTHH:MM:SS.sssZ
cpu21:2118328)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:843: Path (vmhba1:C0:T22:L1) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x29 0x3. Waiting for 20 seconds$
YYYY-MM-DDTHH:MM:SS.sssZ
cpu21:2118328)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:843: Path (vmhba1:C0:T1:L12) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x29 0x3. Waiting for 20 seconds $
YYYY-MM-DDTHH:MM:SS.sssZ
cpu21:2118328)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:843: Path (vmhba1:C0:T2:L12) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x29 0x3. Waiting for 20 seconds $
YYYY-MM-DDTHH:MM:SS.sssZ
cpu0:2097996)ScsiCore: 1823: Power-on Reset occurred on vmhba0:C0:T83:L31
YYYY-MM-DDTHH:MM:SS.sssZ
cpu2:2097996)ScsiCore: 1823: Power-on Reset occurred on vmhba0:C0:T84:L31
YYYY-MM-DDTHH:MM:SS.sssZ
cpu2:2097996)ScsiCore: 1823: Power-on Reset occurred on vmhba0:C0:T84:L32
YYYY-MM-DDTHH:MM:SS.sssZ
cpu2:2097996)ScsiCore: 1823: Power-on Reset occurred on vmhba0:C0:T84:L32
YYYY-MM-DDTHH:MM:SS.sssZ
cpu32:2097604)WARNING: nfnic: <1>: fnic_abort_cmd: 3819: Abort for cmd tag: 0x6fd in pending state
YYYY-MM-DDTHH:MM:SS.sssZ
cpu32:2097604)WARNING: nfnic: <1>: fnic_abort_cmd: 3957: Tag: 0x6fd ABTS_STATUS: 0x47, FLAGS: 0x273, STATE: 0x3
YYYY-MM-DDTHH:MM:SS.sssZ
cpu11:2136878)nfnic: <1>: INFO: fnic_abort_cmd: 3805: Abort cmd called for Tag: 0xa4 issued time: 40002 ms CMD_STATE: FNIC_IOREQ_CMD_PENDING CDB Opcode: 0x12 sc:0x45b9423a1e48 flags: 0x3 lun: 6 target: 0x1f0300
YYYY-MM-DDTHH:MM:SS.sssZ
cpu11:2136878)WARNING: nfnic: <1>: fnic_abort_cmd: 3819: Abort for cmd tag: 0x52a in pending state
YYYY-MM-DDTHH:MM:SS.sssZ
cpu35:2098248)nfnic: <1>: INFO: fnic_fcpio_icmnd_cmpl_handler: 1809: io_req: 0x45b953c07040 sc: 0x45b96c1d3888 tag: 0x52a CMD_FLAGS: 0x53 CMD_STATE: FNIC_IOREQ_ABTS_PENDING ABTS pending hdr status: FCPIO_ABORTED scsi_status: 0x0$
YYYY-MM-DDTHH:MM:SS.sssZ
cpu35:2098248)nfnic: <1>: INFO: fnic_fcpio_itmf_cmpl_handler: 2328: fcpio hdr status: FCPIO_ITMF_REJECTED
YYYY-MM-DDTHH:MM:SS.sssZ
cpu35:2098248)WARNING: nfnic: <1>: fnic_fcpio_itmf_cmpl_handler: 2360: abort reject received id: 0x52a
YYYY-MM-DDTHH:MM:SS.sssZ
cpu35:2098248)nfnic: <1>: INFO: fnic_fcpio_itmf_cmpl_handler: 2405: io_req: 0x45b953c07040 sc: 0x45b96c1d3888 id: 0x52a CMD_FLAGS: 0x73 CMD_STATE: FNIC_IOREQ_ABTS_PENDINGhdr status: FCPIO_ITMF_REJECTED ABTS cmpl received
YYYY-MM-DDTHH:MM:SS.sssZ
cpu11:2136878)WARNING: nfnic: <1>: fnic_abort_cmd: 3957: Tag: 0x52a ABTS_STATUS: 0x47, FLAGS: 0x273, STATE: 0x3
YYYY-MM-DDTHH:MM:SS.sssZ
cpu11:2136878)WARNING: nfnic: <1>: fnic_abort_cmd: 3819: Abort for cmd tag: 0x52a in pending state
YYYY-MM-DDTHH:MM:SS.sssZ
cpu11:2136878)WARNING: nfnic: <1>: fnic_abort_cmd: 3957: Tag: 0x52a ABTS_STATUS: 0x47, FLAGS: 0x273, STATE: 0x3
YYYY-MM-DDTHH:MM:SS.sssZ
cpu32:2097604)nfnic: <1>: INFO: fnic_abort_cmd: 3805: Abort cmd called for Tag: 0x6fd issued time: 120002 ms CMD_STATE: FNIC_IOREQ_CMD_PENDING CDB Opcode: 0x28 sc:0x45d96143ad88 flags: 0x3 lun: 13 target: 0x1f1900
YYYY-MM-DDTHH:MM:SS.sssZ
cpu32:2097604)WARNING: nfnic: <1>: fnic_abort_cmd: 3819: Abort for cmd tag: 0x6fd in pending state
YYYY-MM-DDTHH:MM:SS.sssZ
cpu32:2097604)WARNING: nfnic: <1>: fnic_abort_cmd: 3957: Tag: 0x6fd ABTS_STATUS: 0x47, FLAGS: 0x273, STATE: 0x3
YYYY-MM-DDTHH:MM:SS.sssZ
cpu11:2136878)nfnic: <1>: INFO: fnic_abort_cmd: 3805: Abort cmd called for Tag: 0xa4 issued time: 40002 ms CMD_STATE: FNIC_IOREQ_CMD_PENDING CDB Opcode: 0x12 sc:0x45b9423a1e48 flags: 0x3 lun: 6 target: 0x1f0300
YYYY-MM-DDTHH:MM:SS.sssZ cpu5:2098564)WARNING: NMP: nmpCompleteRetryForPath:394: Logical device "naa.xxxxxxxxxxxxxxxxxxxxxxxxxx": awaiting fast path state update before retrying failed command again...
YYYY-MM-DDTHH:MM:SS.sssZ cpu42:2098564)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.xxxxxxxxxxxxxxxxxxxxxxxxxx": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
YYYY-MM-DDTHH:MM:SS.sssZ cpu46:2098564)WARNING: NMP: nmpCompleteRetryForPath:394: Logical device "naa.xxxxxxxxxxxxxxxxxxxxxxxxxx": awaiting fast path state update before retrying failed command again...
vobd.log:
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10232600491us: [vob.scsi.scsipath.por] Power-on Reset occurred on eui.xxxxxxxxxxxxxx
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10232603661us: [vob.scsi.scsipath.por] Power-on Reset occurred on eui.xxxxxxxxxxxxxx
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10232606294us: [vob.scsi.scsipath.por] Power-on Reset occurred on eui.xxxxxxxxxxxxxx
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10232610858us: [vob.scsi.scsipath.por] Power-on Reset occurred on eui.xxxxxxxxxxxxxx
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10180510538us: [vob.scsi.scsipath.remove] Remove path: vmhba0:C0:T104:L12
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10180593263us: [vob.scsi.scsipath.remove] Remove path: vmhba0:C0:T104:L13
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10180296296us: [vob.scsi.scsipath.remove] Remove path: vmhba0:C0:T104:L14
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 10180529601us: [vob.scsi.scsipath.remove] Remove path: vmhba0:C0:T104:L10
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9820348926us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba0:C0:T102:L7 changed state from on
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9820348928us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba0:C0:T103:L7 changed state from on
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9820348930us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba0:C0:T101:L7 changed state from on
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9820348934us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba0:C0:T100:L7 changed state from on
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9880205647us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba0:C0:T113:L12 changed state from dead
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9880205760us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba0:C0:T113:L1 changed state from dead
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9880206243us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba0:C0:T113:L8 changed state from dead
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9880206365us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba0:C0:T113:L5 changed state from dead
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9821589032us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device eui.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T102:L7 is down. Affected datastores: "Datastore".
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9821592030us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device eui.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T103:L7 is down. Affected datastores: "Datastore".
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9821595008us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device eui.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T101:L7 is down. Affected datastores: "Datastore".
YYYY-MM-DDTHH:MM:SS.sssZ
: [scsiCorrelator] 9821597978us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device eui.xxxxxxxxxxxxxx degraded. Path vmhba0:C0:T100:L7 is down. Affected datastores: "Datastore".
VMware ESXi 7.x
VMware ESXi 8.x
1. Issues with the iSCSI network connectivity
2. Issues with the fabric
3. LUN is unpresented from the host or cluster from the array
4. Faulty SFP module
5. Faulty cables
1. Check and replace any faulty SFP/SFP+ (Small Form-factor Pluggable) modules.
2. Check and replace any faulty cables.
3. Check the fabric switch logs to see if you find any anomalies or errors.
4. Reboot the fabric switches. Check if you find any errors when it's coming back online.
5. Check if the SFPs are negotiating at the right speed. For example - If you find a 16 Gb SFP negotiating at 8 Gb, this must be investigated further or it indicates something wrong with that SFP.
NOTE: HBA driver and firmware must in the Broadcom Compatibility Guide. Please reach out to your server manufacturer or their website to get more information /best practice about the driver and firmware compatible with your HBA hardware.