Symptoms:
During maintenance or upgrade processes involving the shutdown or reboot of the vSAN backup host, users may notice objects becoming inaccessible in the vSAN environment. This issue may manifest in several ways:
7.x, 8.x
This rare occurrence happens when the CMMDS on the vSAN Backup node loses the ability to receive Heartbeats (HBs) but can continue to transmit them for a brief period, especially during a reboot of the vSAN Backup host. As a result, there can be a temporary cluster partition since the backup node may become the Leader node during the reboot process.
vSAN engineering is aware of this issue and is working on a fix, to be included in the next release.
Workaround:
Either wait out the reboot of the backup host for the VMs/objects to become accessible again or if there are critical VMs in the environment that can't handle a temporary outage follow the below steps to network isolate the host.
1) Prior to scheduled maintenance run the below script on any host in the cluster to identify the cluster Backup node
echo -e "\nHostname: Backup_UUID"; SCMU=$(esxcli vsan cluster get | grep 'Sub-Cluster Backup' | awk -F '\: ' '{print $2}'); cmmds-tool find -f json -t HOSTNAME |grep -E "uuid|content"|sed 'N;s/\n/ /'|awk -F \" '{print $10": " $4}'|sort| grep $SCMU
Sample output
Hostname: Backup_UUID
esxi4.vsancluster.org: ########-####-####-####-#############
2) Once the Backup host is identified in vCenter select the host > Configure > VMkernel adapters > vSAN vmk > click on the 3 ellipses Edit and remove the vSAN tag to network isolate the host
Note: For cluster upgrades and using vLCM either exclude the Backup host when upgrading the entire cluster and do the upgrade last after it's been network isolated with the above steps or manually upgrade the hosts one at a time, which is not ideal for large clusters.