Collecting support bundle might cause host failure with vSAN 6.7U1 when using WSFC via vSAN iSCSI Target Service

Products

VMware vSAN

Issue/Introduction

Symptoms:
On vSAN 6.7 U1 when WSFC(Windows Server Failover Cluster) is deployed with vSAN iSCSI service, collecting support bundle from the iSCSI Target owning host might experience a failure. WSFC can still service application as long as target owner is failed over to another available node within the cluster.

The following backtraces have been associated with this issue:

2019-04-11T09:47:22.099Z cpu8:2101798)@BlueScreen: CPU 8 / World 2101798 tried to re-acquire lock 2019-04-11T09:47:22.099Z cpu8:2101798)Code start: 0x418019800000 VMK uptime: 1:23:14:56.708 2019-04-11T09:47:22.100Z cpu8:2101798)0x451a6869b990:[0x41801990ac15]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x0 2019-04-11T09:47:22.100Z cpu8:2101798)0x451a6869ba30:[0x41801990ae48]Panic_NoSave@vmkernel#nover+0x4d stack: 0x451a6869ba90 2019-04-11T09:47:22.100Z cpu8:2101798)0x451a6869ba90:[0x41801982e006]LockCheckSelfDeadlockInt@vmkernel#nover+0x5b stack: 0x41801995b91e 2019-04-11T09:47:22.101Z cpu8:2101798)0x451a6869bab0:[0x418019914d66]SP_WaitLock@vmkernel#nover+0x15b stack: 0x451a68623000 2019-04-11T09:47:22.101Z cpu8:2101798)0x451a6869baf0:[0x418019914ddc]SPLockWork@vmkernel#nover+0x29 stack: 0x451a68623000 2019-04-11T09:47:22.102Z cpu8:2101798)0x451a6869bb00:[0x41801993aa7b]World_WakeupWithState@vmkernel#nover+0x20 stack: 0x451a686a3001 2019-04-11T09:47:22.102Z cpu8:2101798)0x451a6869bb20:[0x41801993cb42]WorldWaitTimeout@vmkernel#nover+0x1b stack: 0x201225 2019-04-11T09:47:22.102Z cpu8:2101798)0x451a6869bb30:[0x41801991c612]Timer_BHHandler@vmkernel#nover+0xe3 stack: 0x1441cdb400000 2019-04-11T09:47:22.103Z cpu8:2101798)0x451a6869bbb0:[0x4180198cc9b2]BH_Check@vmkernel#nover+0x77 stack: 0x0 2019-04-11T09:47:22.103Z cpu8:2101798)0x451a6869bc30:[0x418019b02f39]CpuSched_SafePreemptionPoint@vmkernel#nover+0x16 stack: 0x0 2019-04-11T09:47:22.108Z cpu8:2101798)base fs=0x0 gs=0x418042000000 Kgs=0x0

2019-03-20T14:43:06.111Z cpu0:2101340)@BlueScreen: CPU 0 / World 2101340 tried to re-acquire lock 2019-03-20T14:43:06.111Z cpu0:2101340)Code start: 0x41803bc00000 VMK uptime: 8:23:33:30.869 2019-03-20T14:43:06.111Z cpu0:2101340)0x451a6859bb68:[0x41803bd0ac15]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x430dad05c290 2019-03-20T14:43:06.112Z cpu0:2101340)0x451a6859bc08:[0x41803bd0ae48]Panic_NoSave@vmkernel#nover+0x4d stack: 0x451a6859bc68 2019-03-20T14:43:06.112Z cpu0:2101340)0x451a6859bc68:[0x41803bc2e006]LockCheckSelfDeadlockInt@vmkernel#nover+0x5b stack: 0x0 2019-03-20T14:43:06.112Z cpu0:2101340)0x451a6859bc88:[0x41803bd14d66]SP_WaitLock@vmkernel#nover+0x15b stack: 0x451a685a3000 2019-03-20T14:43:06.113Z cpu0:2101340)0x451a6859bcc8:[0x41803bd14ddc]SPLockWork@vmkernel#nover+0x29 stack: 0x451a00000001 2019-03-20T14:43:06.113Z cpu0:2101340)0x451a6859bcd8:[0x41803bd3a866]WorldWaitInt@vmkernel#nover+0x143 stack: 0x451a6859bd28 2019-03-20T14:43:06.114Z cpu0:2101340)0x451a6859bda8:[0x41803c34047f]UserObj_Poll@(user)#<None>+0x190 stack: 0x451a6859be70 2019-03-20T14:43:06.114Z cpu0:2101340)0x451a6859be18:[0x41803c38ab3e]LinuxFileDesc_Select@(user)#<None>+0x9b stack: 0x451a6859be90 2019-03-20T14:43:06.115Z cpu0:2101340)0x451a6859bee8:[0x41803c33b31b]User_LinuxSyscallHandler@(user)#<None>+0x180 stack: 0x451a6859bfc8 2019-03-20T14:43:06.115Z cpu0:2101340)0x451a6859bf28:[0x41803bd2aebc]User_LinuxSyscallHandler@vmkernel#nover+0x1d stack: 0x10b 2019-03-20T14:43:06.115Z cpu0:2101340)0x451a6859bf38:[0x41803bd60066]gate_entry@vmkernel#nover+0x67 stack: 0x0 2019-03-20T14:43:06.120Z cpu0:2101340)base fs=0x0 gs=0x418040000000 Kgs=0x0

Environment

VMware vSAN 6.7.x

Cause

WSFC(Windows Server Failover Cluster) leverages SCSI-3 Persistent Reservations (PR) to avoid split brain problem in the failover cluster. When it is configured on vSAN iSCSI Target LUNs, the metadata for PR is saved to vSAN datastore. PR metadata needs be dumped from the live system via a VSI (Vmkernel Sysinfo Interface) command when collecting the support bundle. However, the system may fail when processing the VSI command to dump the PR metadata.

Resolution

This issue has been fixed in vSAN 6.7U2

Workaround:
This issue only happens when collecting support bundle on vSAN 6.7U1 with WSFC deployed for vSAN iSCSI Target service. Two methods to avoid this issue:
1. Prefer upgrading to 6.7U2 then collecting support bundle.
2. If still in 6.7U1, when trying to collect support bundle on a host, need to put it into Maintenance Mode before collecting, and then move the host out of Maintenance Mode after collecting is done.

How to figure out the iSCSI Target owner host:
From vCenter GUI, find iSCSI host info from "I/O Owner Host" under "Configure" -> "vSAN iSCSI Service" Tab