Potential data interruption while performing Network Maintenance on a vSAN cluster while the Primary/Backup host is in Maintenance Mode.
search cancel

Potential data interruption while performing Network Maintenance on a vSAN cluster while the Primary/Backup host is in Maintenance Mode.

book

Article ID: 326782

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Planned network maintenance of a vSAN cluster with live production
  • One or more hosts in maintenance mode with no inaccessible objects
  • One of the hosts in maintenance mode is either the Primary or the Backup host
  • During network maintenance, another host not in maintenance mode loses heartbeats with the Primary host of the cluster causing objects to go inaccessible
  • vSAN cluster partition occurs during lost heartbeats



Environment

VMware vSAN 7.0.x
VMware vSAN 6.x

Cause

The current design of vSAN is that nodes in maintenance mode are still part of the cluster. CMMDS has no awareness of maintenance mode. So if the host in maintenance mode is the Primary/Backup that host is still the Primary/Backup of the cluster. Its resources will be unavailable to the cluster, but it will still be a member of the cluster.

When network upgrades/firmware changes/switch changes are done while this host is in maintenance mode, it can temporarily lead to packet drops, asymmetric connectivity, etc., which can lead to potential temporary network partitions in the cluster.

Resolution

  • If you're going to do any sort of network upgrades/changes to the vSAN cluster, do it one host at a time if possible.
  • Plan a maintenance window for any maintenance to be performed on a production cluster in case the maintenance activities don't go smoothly.
  • Reboot the Primary/Backup host so a new Primary/Backup host is elected prior to making changes to these hosts to avoid negatively impacting the cluster by maintenance operations that could lead to potential production outages.
  • To determine which host is the Primary/Backup host in the cluster run the following scripts on any host in the cluster:
echo -e "\nHostname: Master_UUID"; SCMU=$(esxcli vsan cluster get | grep 'Sub-Cluster Master' | awk -F '\: ' '{print $2}'); cmmds-tool find -f json -t HOSTNAME |grep -E "u
uid|content"|sed 'N;s/\n/ /'|awk -F \" '{print $10": " $4}'|sort| grep $SCMU

Hostname: Master_UUID
esxi2.companydomain.org: 5f7f09c2-xxxx-xxxx-xxxx-0050560181d5

echo -e "\nHostname: Backup_UUID"; SCMU=$(esxcli vsan cluster get | grep 'Sub-Cluster Backup' | awk -F '\: ' '{print $2}'); cmmds-tool find -f json -t HOSTNAME |grep -E "u
uid|content"|sed 'N;s/\n/ /'|awk -F \" '{print $10": " $4}'|sort| grep $SCMU

Hostname: Backup_UUID
esxi3.companydomain.org: 5f7f09d3-xxxx-xxxx-xxxx-0050560181e8


Additional Information