Fabric logins fail to complete to specific targets and/or all commands never complete (and aborts fail) on specific storage paths after a Cisco MDS Supervisor switchover
search cancel

Fabric logins fail to complete to specific targets and/or all commands never complete (and aborts fail) on specific storage paths after a Cisco MDS Supervisor switchover

book

Article ID: 420218

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 8.x VMware vSphere ESXi 8.0 VMware vSphere ESX 7.x

Issue/Introduction

A VCF Administrator could observe a few possible symptoms:

1. HBA(s) with dead paths that do not recover even after a host reboot or HBA reset
2. HBA(s) report that they are in a link-down state
3. All SCSI commands for a specific path(s) timeout and all aborts for those commands also timeout

Environment

ESXi (Any Version)
Cisco MDS switches

Cause

There are three possible scenarios of symptoms that are observed from ESXi hosts:

1. The ESXi HBA(s) are still logged in to the array target(s) but IO does not complete (always SCSI command timeouts) on specific paths. Symptoms are the HBA will send commands to an affected WWPN, those commands will timeout, and then aborts are sent but those aborts also timeout. The abort timeouts could eventually lead the HBA driver to perform a firmware reset of the HBA because aborts are timing out in addition to the command timeouts (observed for Cisco's NFNIC driver)
2. Target has been logged out already (ESXi host reboot or HBA reset) and FLOGIs fail to the Cisco MDS switch for the HBA resulting in the physical link being reported as "link-down"
3. Target has been logged out already (ESXi host reboot or HBA reset) and PLOGIs fail to register successfully, leading the PLOGI process to loop indefinitely

Example Scenario 1:

Since Round Robin PSP is being used (Default), IO is spread across all active working paths. As such, only the affected paths/target WWPNs will see the combination of SCSI command timeout followed by an abort timeout for that same SCSI command for every SCSI command sent:

VM issues abort after command timeout:

2025-11-15T21:35:02.874Z In(182) vmkernel: cpu215:2102358)PVSCSI: 2769: scsi0:0: SCSI ABORT ctx=0x319

Abort issued to target ID 0x6504e0 (this translates to a WWPN) for SCSI Command (sc) 0x45db01babf00:

2025-11-15T21:35:02.874Z In(182) vmkernel: cpu215:2102358)nfnic: <1>: INFO: fnic_taskMgmt: 2196: TaskMgmt abort sc->cdb: 0x2a sllid: 0xffffffffffffffff
2025-11-15T21:35:02.874Z In(182) vmkernel: cpu215:2102358)nfnic: <1>: INFO: fnic_abort_cmd: 3874: Abort cmd called for Tag: 0x312  issued time: 180417 ms CMD_STATE: FNIC_IOREQ_CMD_PENDING CDB Opcode: 0x2a  sc:0x45db01babf00 flags: 0x3 lun: 252 target: 0x6504e0

Abort timeout:

2025-11-15T21:35:02.874Z Wa(180) vmkwarning: cpu215:2102358)WARNING: nfnic: <1>: fnic_abort_cmd: 3889: Abort for cmd tag: 0x312 in pending state2025-11-15T21:35:04.881Z In(182) vmkernel: cpu130:2099175)nfnic: <1>: INFO: fnic_fcpio_icmnd_cmpl_handler: 1870: io_req: 0x45bb28800270 sc: 0x45db01babf00 tag: 0x312 CMD_FLAGS: 0x53 CMD_STATE: FNIC_IOREQ_ABTS_PENDING ABTS pending hdr status: FCPIO_ABORTED scsi_status: 0x$
2025-11-15T21:35:04.881Z In(182) vmkernel: cpu130:2099175)nfnic: <1>: INFO: fnic_fcpio_itmf_cmpl_handler: 2396: fcpio hdr status: FCPIO_TIMEOUT

This will happen for every single command sent by that HBA to that array target. Depending on the HBA driver, the driver may decide to error handle this situation by issuing a firmware reset to the HBA over and over again. In this scenario, Cisco's NFNIC driver will issue the firmware reset after enough Abort commands timeout within a threshold:

2025-11-15T11:58:36.393Z In(182) vmkernel: cpu12:2101496)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T11:58:36.393Z In(182) vmkernel: cpu12:2101496)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T12:39:41.920Z In(182) vmkernel: cpu36:4916745)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T12:39:41.920Z In(182) vmkernel: cpu36:4916745)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T13:10:30.086Z In(182) vmkernel: cpu62:2115538)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T13:10:30.086Z In(182) vmkernel: cpu62:2115538)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T13:35:23.537Z In(182) vmkernel: cpu50:4876243)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T13:35:23.537Z In(182) vmkernel: cpu50:4876243)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T15:47:45.317Z In(182) vmkernel: cpu77:4876243)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T15:47:45.317Z In(182) vmkernel: cpu77:4876243)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T17:03:04.911Z In(182) vmkernel: cpu165:4876243)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T17:03:04.911Z In(182) vmkernel: cpu165:4876243)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T18:23:02.308Z In(182) vmkernel: cpu246:4876243)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T18:23:02.308Z In(182) vmkernel: cpu246:4876243)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T19:10:32.069Z In(182) vmkernel: cpu59:2101955)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T19:10:32.069Z In(182) vmkernel: cpu59:2101955)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]
2025-11-15T19:58:35.703Z In(182) vmkernel: cpu166:3182159)nfnic: <1>: INFO: fnic_host_reset: 4839: fnic_reset fnic[1]
2025-11-15T19:58:35.703Z In(182) vmkernel: cpu166:3182159)nfnic: <1>: INFO: fnic_reset: 4807: fnic_reset fnic[1]


Example Scenario 2: 

In this scenario, the ESXi host has already been rebooted or the fabric connection reset (causing a link-down/link-up event for the HBA). When the link-up event occurs for the HBA, it is unable to complete a Fabric Login (FLOGI):

Link up for the HBA:

2025-11-09T05:38:49.896Z In(182) vmkernel: cpu31:2098285)nfnic: <2>: INFO: fnic_handle_link: 1001: link status 1 down cnt 3
2025-11-09T05:38:49.896Z In(182) vmkernel: cpu31:2098285)nfnic: <2>: INFO: fnic_handle_link: 1003: old status 0 old down cnt 3
2025-11-09T05:38:49.896Z In(182) vmkernel: cpu31:2098285)nfnic: <2>: INFO: fnic_handle_link: 1068: fnic2: link up
2025-11-09T05:38:49.896Z In(182) vmkernel: cpu31:2098285)nfnic: <2>: INFO: fnic_fdls_link_status_change: 98: fnic2: FDLS link status change link up:1, usefip:0

HBA driver sends FLOGI request:

2025-11-09T05:38:49.896Z In(182) vmkernel: cpu31:2098285)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:38:49.896Z In(182) vmkernel: cpu31:2098285)nfnic: <2>: INFO: fnic_fdls_link_status_change: 113: speed: link_speed: 20

FLOGI is ignored by the Cisco MDS switch, leaving the HBA effectively in a link-down state:

2025-11-09T05:38:53.898Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:38:53.899Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:38:53.899Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:38:57.899Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:38:57.899Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:38:57.899Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:39:01.902Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:39:01.902Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:39:01.902Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:39:05.903Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:39:05.903Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:39:05.903Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:39:09.905Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:39:09.906Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:39:09.906Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:39:13.908Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:39:13.908Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:39:13.908Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:39:17.911Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:39:17.911Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:39:17.911Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-11-09T05:39:21.912Z In(182) vmkernel: cpu5:2098283)nfnic: <2>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 4
2025-11-09T05:39:21.912Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_process_fabric_abts_rsp: 3196: Received abts rsp BA_ACC for fabric_state: 4 OX_ID: 0x110
2025-11-09T05:39:21.912Z In(182) vmkernel: cpu23:2098286)nfnic: <2>: INFO: fdls_error_fabric_disc: 2687: FDLS discovery error from 4 state


Example Scenario 3: 

In this scenario, the ESXi host has already been rebooted or the fabric connection reset (causing a link-down/link-up event for the HBA). When the link-up event occurs for the HBA, it is able to successfully perform a fabric login (FLOGI) to the Cisco MDS switch however when performing Port Logins (PLOGI), the registration process times out and then loops indefinitely:

Link Up Event:

2025-09-14T03:25:24.438Z cpu30:2098207)nfnic: <1>: INFO: fnic_handle_link: 1001: link status 1 down cnt 0
2025-09-14T03:25:24.438Z cpu30:2098207)nfnic: <1>: INFO: fnic_handle_link: 1003: old status 0 old down cnt 0
2025-09-14T03:25:24.438Z cpu30:2098207)nfnic: <1>: INFO: fnic_handle_link: 1068: fnic1: link up
2025-09-14T03:25:24.438Z cpu30:2098207)nfnic: <1>: INFO: fnic_fdls_link_status_change: 98: fnic1: FDLS link status change link up:1, usefip:0

Fabric Login is sent and completes:

2025-09-14T03:25:24.440Z cpu30:2098207)nfnic: <1>: INFO: fdls_send_fabric_flogi: 982: Sending fabric FLOGI for wwpn:0x20000025b5###### Setting FLOGI MFS to 2048
2025-09-14T03:25:24.440Z cpu30:2098207)nfnic: <1>: INFO: fnic_fdls_link_status_change: 113: speed: link_speed: 20
2025-09-14T03:25:24.468Z cpu77:2098208)nfnic: <1>: INFO: fdls_process_flogi_rsp: 450: FLOGI response accepted: fcid:0x10b18 fabric_wwpn:0x240a003a########
2025-09-14T03:25:24.469Z cpu62:2098082)nfnic: <1>: INFO: fnic_fcpio_flogi_reg_cmpl_handler: 1203: FLOGI reg succeeded
2025-09-14T03:25:24.469Z cpu62:2098082)nfnic: <1>: INFO: fnic_fcpio_flogi_reg_cmpl_handler: 1228: FLOGI REG done. Waking up
2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fnic_fdls_register_portid: 1998: FLOGI registration success

PLOGI is sent:

2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fdls_send_fabric_plogi: 1007: Sending fabric PLOGI for wwpn:0x20000025b5###### Setting PLOGI MFS to 2048
2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fdls_send_fdmi_plogi: 1022: FDLS send FDMI PLOGI 0x43139b0013d0
2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fdls_process_fabric_plogi_rsp: 560: FDLS process fabric PLOGI response FC_LS_ACC
2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fdls_send_rpn_id: 1058: Sending fabric RPNID for fcid:0x10b18
2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fdls_process_rpn_id_rsp: 2010: FDLS process RPN ID response: 0x0280
2025-09-14T03:25:24.469Z cpu77:2098208)nfnic: <1>: INFO: fdls_send_register_fc4_types: 1210: FDLS sending FC4 Types for fcid:0x10b18

Registration fails after the PLOGI hits a 4 second timeout, and then a 20 second timeout trying to abort the PLOGI. This repeats indefinitely:

2025-09-14T03:25:48.927Z Wa(180) vmkwarning: cpu25:2098205)WARNING: nfnic: <1>: fdls_fabric_timer_callback: 2852: ABTS timed out for FDLS_STATE_REGISTER_FC4_TYPES. Check fabric controller. Starting PLOGI. 0x43139b0013d0
2025-09-14T03:25:48.927Z In(182) vmkernel: cpu25:2098205)nfnic: <1>: INFO: fdls_send_fabric_plogi: 1007: Sending fabric PLOGI for wwpn:0x20000025b5###### Setting PLOGI MFS to 2048
2025-09-14T03:25:48.927Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_process_fabric_plogi_rsp: 560: FDLS process fabric PLOGI response FC_LS_ACC
2025-09-14T03:25:48.927Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_send_rpn_id: 1058: Sending fabric RPNID for fcid:0x10b18
2025-09-14T03:25:48.928Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_process_rpn_id_rsp: 2010: FDLS process RPN ID response: 0x0280
2025-09-14T03:25:48.928Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_send_register_fc4_types: 1210: FDLS sending FC4 Types for fcid:0x10b18
2025-09-14T03:25:52.928Z In(182) vmkernel: cpu25:2098205)nfnic: <1>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 7
2025-09-14T03:26:12.932Z Wa(180) vmkwarning: cpu25:2098205)WARNING: nfnic: <1>: fdls_fabric_timer_callback: 2852: ABTS timed out for FDLS_STATE_REGISTER_FC4_TYPES. Check fabric controller. Starting PLOGI. 0x43139b0013d0
2025-09-14T03:26:12.932Z In(182) vmkernel: cpu25:2098205)nfnic: <1>: INFO: fdls_send_fabric_plogi: 1007: Sending fabric PLOGI for wwpn:0x20000025b5###### Setting PLOGI MFS to 2048
2025-09-14T03:26:12.933Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_process_fabric_plogi_rsp: 560: FDLS process fabric PLOGI response FC_LS_ACC
2025-09-14T03:26:12.933Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_send_rpn_id: 1058: Sending fabric RPNID for fcid:0x10b18
2025-09-14T03:26:12.933Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_process_rpn_id_rsp: 2010: FDLS process RPN ID response: 0x0280
2025-09-14T03:26:12.933Z In(182) vmkernel: cpu77:2098208)nfnic: <1>: INFO: fdls_send_register_fc4_types: 1210: FDLS sending FC4 Types for fcid:0x10b18
2025-09-14T03:26:16.935Z In(182) vmkernel: cpu25:2098205)nfnic: <1>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 7
2025-09-14T03:26:36.939Z Wa(180) vmkwarning: cpu25:2098205)WARNING: nfnic: <1>: fdls_fabric_timer_callback: 2852: ABTS timed out for FDLS_STATE_REGISTER_FC4_TYPES. Check fabric controller. Starting PLOGI. 0x43139b0013d0
2025-09-14T03:26:36.939Z In(182) vmkernel: cpu25:2098205)nfnic: <1>: INFO: fdls_send_fabric_plogi: 1007: Sending fabric PLOGI for wwpn:0x20000025b5###### Setting PLOGI MFS to 2048
2025-09-14T03:26:36.940Z In(182) vmkernel: cpu56:2098208)nfnic: <1>: INFO: fdls_process_fabric_plogi_rsp: 560: FDLS process fabric PLOGI response FC_LS_ACC
2025-09-14T03:26:36.940Z In(182) vmkernel: cpu56:2098208)nfnic: <1>: INFO: fdls_send_rpn_id: 1058: Sending fabric RPNID for fcid:0x10b18
2025-09-14T03:26:36.940Z In(182) vmkernel: cpu56:2098208)nfnic: <1>: INFO: fdls_process_rpn_id_rsp: 2010: FDLS process RPN ID response: 0x0280
2025-09-14T03:26:36.940Z In(182) vmkernel: cpu56:2098208)nfnic: <1>: INFO: fdls_send_register_fc4_types: 1210: FDLS sending FC4 Types for fcid:0x10b18
2025-09-14T03:26:40.943Z In(182) vmkernel: cpu25:2098205)nfnic: <1>: INFO: fdls_send_fabric_abts: 955: FDLS sending fabric abts. iport->fabric.state: 7
2025-09-14T03:27:00.946Z Wa(180) vmkwarning: cpu25:2098205)WARNING: nfnic: <1>: fdls_fabric_timer_callback: 2852: ABTS timed out for FDLS_STATE_REGISTER_FC4_TYPES. Check fabric controller. Starting PLOGI. 0x43139b0013d0

Resolution

This is a bug on Cisco MDS switches. The MDS Supervisor ACL (Access Control List) process maintains a database of all the fabric logins in a F-Port Channel. This DB is synced to the standby supervisor regularly. There is a bug where the synchronization does not work correctly and the standby MDS Supervisor doesn't receive the full list of all fabric logins on the F-Port Channel. When a switchover is performed between the supervisors, the new active supervisors ACL DB is missing some of those logins. If a module is replaced, swapped or new members are added to the F-Port Channel, the missing entries are not programmed on the newly added Port Channel members. This results in the ACLs missing and loss of paths. Some IOs will work correctly if they are hashed to the Port Channel members that were existing but when an IO goes to a member where there are missing ACLs, the IO will not be successful.

This is being tracked via the following Cisco MDS bug: CSCwr78631: Devices unable to communicate over an F Port-Channel after a supervisor switchover 

The fix will be out as part of the next version NX-OS 9.4(5). This is due sometime late December 2025 or January 2026.

WORKAROUND:

To workaround this issue, perform a shut/no shut on each F-Port Channel to repopulate the ACL DB. Until NX-OS 9.4(5) is available, if switch maintenance is required then perform the shut/no shut on the port channel after the switchover has completed.