ESXi path failover not occurring with Cisco MDS switches running NX-OS 7.3

Products

VMware vSphere ESXi

Issue/Introduction

VMware recommends not to use NX-OS 7.3 on Cisco MDS until the issue is fixed.

Symptoms:

When rebooting a storage array controller, path failover does not occur.
In the /var/log/vmkernel log file, you see that the RSCN (Register State Change Notification) is reported:

2016-07-07T17:46:03.208Z cpu31:33467)<6>host2: disc: Received an RSCN event
2016-07-07T17:46:03.208Z cpu31:33467)<6>host2: disc: Port address format for port (e50900)
2016-07-07T17:46:03.208Z cpu31:33467)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
2016-07-07T17:46:03.209Z cpu17:33475)<6>host2: rport e50900: ADISC port
2016-07-07T17:46:03.209Z cpu17:33475)<6>host2: rport e50900: sending ADISC from Ready state

You see IO ABORTED and IO TIMEOUT errors:

2016-07-07T17:46:06.148Z cpu4:33330)<7>fnic : 2 :: Abort Cmd called FCID 0xe50900, LUN 0x1 TAG c0 flags 3 2016-07-07T17:46:08.139Z cpu20:33228)<7>fnic : 2 :: abts cmpl recd. id 192 status FCPIO_TIMEOUT 2016-07-07T17:46:08.139Z cpu4:33330)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED 2016-07-07T17:46:08.139Z cpu4:33330)WARNING: LinScsi: SCSILinuxAbortCommands:1891: Failed, Driver fnic, for vmhba4 2016-07-07T17:46:09.143Z cpu50:32913)<7>fnic : 2 :: Abort Cmd called FCID 0xe50900, LUN 0x1 TAG c0 flags 273 2016-07-07T17:46:09.143Z cpu20:33288)<7>fnic : 2 :: abts cmpl recd. id 192 status FCPIO_ABORTED 2016-07-07T17:46:09.143Z cpu50:32913)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED

In addition, the ADISC response returns Error 1:

2016-07-07T17:46:23.211Z cpu55:33447)<6>host2: rport e50900: Received a ADISC response
2016-07-07T17:46:23.211Z cpu55:33447)<6>host2: rport e50900: Error 1 in state ADISC, retries 0
2016-07-07T17:46:23.211Z cpu55:33447)<6>host2: rport e50900: Port entered LOGO state from ADISC stat

Notes:
- This continues until the array controller is back online.
- This log excerpt is an example. Date, time, and environmental variables may vary depending on your environment.

Cause

When an RSCN is sent to the fabric by the array controller notifying all members of the zone that it is going offline, libfc will issue a GPN_ID (Get Port Name ID) to the nameserver to validate that the port is actually down before attempting to failover. The switch should reject this request because the port is supposed to be down. However, due to a timing issue in NX-OS 7.3 firmware the FCNS database is not cleaned up before the GPN_ID is received and replied. It is this reply that causes libfc to attempt an Address Discovery (ADISC). Instead of disabling the ports the ADISC goes unanswered and times out repeatedly again until the array controller is back online.

Resolution

Currently, there is no resolution or workaround.

Currently, VMware does not recommend running NX-OS 7.3 on Cisco MDS switches until Cisco has provided a code fix. This is due to the fact that VMware cannot guarantee a successful path failover in the event of storage array controller maintenance or upgrades that require rebooting the controller.

Referenced Cisco bug ID CSCva64432

Note: You need a Cisco login to view.

Disclaimer: VMware is not responsible for the reliability of any data, opinions, advice, or statements made on third-party websites. Inclusion of such links does not imply that VMware endorses, recommends, or accepts any responsibility for the content of such sites.