In case a partition occurs, it can be healed through either a series of manual steps or a script that runs in parallel on all hosts.
To resolve this issue the Supports Unicast=false will need to manually be updated to true.
Note: The automated approach explained below is highly recommended instead of the manual approach here.
Note: These steps need to be completed on each host.
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint
------------------------------------ --------- ---------------- ------------- ----- ---------- -----------------------------------------------------------
########-####-####-####-########0000 1 false 10.10.20.1 12321
########-####-####-####-########7305 0 true 10.10.16.5 12321 ......
########-####-####-####-########7b06 0 true 10.10.16.6 12321 ......
########-####-####-####-########7b07 0 true 10.10.16.7 12321 ......
########-####-####-####-########7f08 0 true 10.10.16.8 12321 ......
For example: esxcli vsan cluster unicastagent remove -a 10.10.20.1
If the entry is missing
Add the unicast entry manually:
esxcli vsan cluster unicastagent add -a xxx.xxx.xxx.xxx -u NodeUUID -U 1 -t witness
For example:
esxcli vsan cluster unicastagent add -a 10.10.20.1 -u ########-####-####-####-########7310 -U 1 -t witness
Note:
Example:
# python remediate_witness.py -h 10.184.100.218,10.184.107.87,10.184.105.206,10.184.102.27 -a 1.2.3.4 -U ########-####-####-####-########ea91
DEBUG: Remediate hosts: ['10.184.100.218', '10.184.107.87', '10.184.105.206', '10.184.102.27']
10.184.100.218: esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness
10.184.107.87: esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness
10.184.102.27: esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness
10.184.105.206: esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness
DEBUG: Running command=esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness on host 10.184.100.218
DEBUG: Running command=esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness on host 10.184.102.27
DEBUG: Running command=esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness on host 10.184.105.206
DEBUG: Running command=esxcli vsan cluster unicastagent remove -a 1.2.3.4; esxcli vsan cluster unicastagent add -a 1.2.3.4 -u ########-####-####-####-########ea91 -U 1 -t witness on host 10.184.107.87
DEBUG: Result rc=0 stdout= stderr= on host 10.184.107.87
DEBUG: Result rc=0 stdout= stderr= on host 10.184.102.27
DEBUG: Result rc=0 stdout= stderr= on host 10.184.105.206
DEBUG: Result rc=0 stdout= stderr= on host 10.184.100.218
To avoid the partitioning from occurring, the vCenter Server must be on 6.7 U3 or later. As described above in Case 2, the upgrade of ESXi to 6.5/6.7 will not trigger the multicast to unicast change on the cluster (this can be confirmed with "esxcli vsan cluster get" to check on each host).
So, the above script can be run proactively to fix the witness unicast config on all data nodes. This script executes these commands on all hosts at the same time in parallel, so that all hosts can be updated within 30 seconds to avoid the APD timeout (which is why manual steps are not recommended). This proactive step will make sure that there will be no partition during cluster remediation leading to the VC going inaccessible. Run this script as quickly as possible after the ESXi upgrade, and avoid making any configuration changes (see the below list) until script is done. Once this is done, one can check whether each host is in expected state by running esxcli vsan cluster get command (no partition and running in unicast mode).
Note that the script will cause a temporary partition (due to hosts moving from the multicast to unicast mode) which will heal quickly within 30 seconds. However, if script runs into any issue which prevents it to fix a host's unicast config, it will cause a cluster partition which will not heal without the manual intervention. Below are some of the things which can be verified before running the script.
As described above in Case 2, witness unicast config update will take place automatically(we want to avoid this) if a cluster remediation event happens. Below is a list of changes and user actions which can indirectly cause a cluster remediation event. These changes and actions should be avoided after upgrading the ESXi hosts and before running the scripts. Note that this is not an exhaustive list. To avoid any possible change to the system after the upgrade, and run the script as soon as possible.