Dead paths alarms in vCenter reported periodically for a single host and HBA to all storage array targets
search cancel

Dead paths alarms in vCenter reported periodically for a single host and HBA to all storage array targets

book

Article ID: 425918

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 8.x VMware vSphere ESXi 8.0

Issue/Introduction

A VCF Administrator observes storage path redundancy alarms occurring for a single host and HBA to every storage array target it is zoned to:

YYYY-MM-DDThh:mm:ss.nnnZ In(14) vobd[2097956]:  [scsiCorrelator] 2499017659783us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba64:C0:T1:L0 changed state from on (device ID: naa.60002ac0000000000000############)
YYYY-MM-DDThh:mm:ss.nnnZ In(14) vobd[2097956]:  [scsiCorrelator] 2499039020075us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60002ac0000000000000############ degraded. Path vmhba64:C0:T1:L0 is down. Affected datastores: "<Datastore1>".
YYYY-MM-DDThh:mm:ss.nnnZ In(14) vobd[2097956]:  [scsiCorrelator] 2499017660215us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba64:C0:T1:L1 changed state from on (device ID: naa.60002ac0000000000000############)
YYYY-MM-DDThh:mm:ss.nnnZ In(14) vobd[2097956]:  [scsiCorrelator] 2499039020788us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60002ac0000000000000############ degraded. Path vmhba64:C0:T1:L1 is down. Affected datastores: "<Datastore2>".



 

Environment

ESXi (All Versions)
Qlogic QEDF driver (as example)

Cause

When reviewing /var/log/vmkernel.log, events related to the port IDs being disabled and then going active again shortly afterwards are reported over and over again:

YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_rport_event_handler:1228:Info: ST(RPORT): OFFLOADED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu76:2098400)qedf:vmhba64:qedfc_queue_scsi_scan:4083:Info: C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]

Eventually, it could be observed that the device loss (DEV_LOSS) timer, which is 10 seconds for the Qlogic QEDF driver, gets triggered, which will result in dead paths being declared before going ACTIVE again:

YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu72:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu72:2098400)qedf:vmhba64:qedfc_cleanup_rport:1110:Info: ST(RPORT): DISABLED C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]

YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu28:2098097)qedf:vmhba64:qedfc_device_down:318:Info: ST(RPORT): DEV_LOSS C_ID[0x1]:P_ID[0x3cd40]:T_ID[1], Status = Success
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu28:2098097)qedf:vmhba64:qedfc_device_down:318:Info: ST(RPORT): DEV_LOSS C_ID[0x0]:P_ID[0x3cd80]:T_ID[0], Status = Success

YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu72:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x0]:P_ID[0x3cd80]:T_ID[0]
YYYY-MM-DDThh:mm:ss.nnnZ In(182) vmkernel: cpu72:2098400)qedf:vmhba64:qedfc_alloc_conn_id:803:Info: ACTIVE, C_ID[0x1]:P_ID[0x3cd40]:T_ID[1]

Resolution

A Keep Alive function is utilized by the Qlogic QEDF HBA driver to proactively mark paths as Down/Dead if the DEV_LOSS timer ever crosses the 10 second threshold. When you see repeated behavior like this, especially for either a single HBA or perhaps a single port on that HBA, the culprit is usually a bad cable/SFP, or low light levels for the physical connection between the HBA and the switch port or the storage array port and the switch port. Either way, this is a physical layer issue and not an ESXi software or HBA driver issue and should be pursued by your Storage/Storage switch team to validate which piece of layer one hardware should be replaced.

Additional Information

Japanese version of this KB : https://knowledge.broadcom.com/external/article/431125