vSAN cluster exhibits network partitions and inaccessible objects after restoring vCenter Server from backup
search cancel

vSAN cluster exhibits network partitions and inaccessible objects after restoring vCenter Server from backup

book

Article ID: 432331

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

After restoring a vCenter Server from a file-based backup (e.g., following a power outage or database corruption), a vSAN cluster may experience a total or partial network partition.

Symptoms may include:

  • ESXi hosts in the same vSphere cluster reporting as "Master" of their own 1-node partition. (verified via esxcli vsan cluster get).

  • Virtual machines showing as (inaccessible) in the vCenter inventory.

  • vSAN Health service reporting "Sub-cluster member count mismatch" or "Network communication" errors.

  • vmkping -I vmkX tests between the vSAN VMkernel interfaces succeed at standard MTU (e.g., 1500 MTU / 1472 payload) and/or large MTU (e.g., 9000 / 8972 payload), but the cluster fails to form.

Environment

VMware vSAN (All Versions)

 

Cause

vSAN uses a Unicast Agent list to manage cluster communication. This list is maintained by the vCenter Server and pushed to the ESXi hosts. When vCenter is restored from a backup, the internal vSAN membership metadata in the vCenter database may be inconsistent with the actual state of the hosts. As a result:

  1. The vCenter Server fails to push the correct Unicast Agent updates to the ESXi hosts.

  2. The ESXi hosts may have empty or stale unicast tables (verified via esxcli vsan cluster unicastagent list).

  3. Without a valid unicast list, hosts cannot "see" their peers to exchange heartbeats, leading to a partition despite physical network connectivity.

Resolution

To resolve this inconsistency, you must force the vCenter Server to re-synchronize the vSAN cluster membership metadata with the ESXi hosts.

  1. Verify Connectivity: Confirm that the physical network is not the issue by running a ping test between all vSAN VMkernel interfaces:

    vmkping -I vmkX <Peer_vSAN_IP> -s 1472 -d
    vmkping -I vmkX <Peer_vSAN_IP> -s 8972 -d if jumbo frames is in use
  2. Audit Unicast Tables: Check if the unicast agent list is empty or incomplete on the partitioned hosts:

    esxcli vsan cluster unicastagent list
    
  3. Trigger Re-synchronization: Instead of manually editing CLI tables on every host, use the following non-destructive method to force vCenter to rebuild the tables:

    • In the vSphere Client, locate the affected cluster.

    • Select one ESXi host, right-click, and select Connection > Disconnect.

    • Once the host is disconnected, right-click again and select Connection > Connect.

  4. Verification:

    • Re-run esxcli vsan cluster unicastagent list on all nodes; the table should now be populated with all cluster members.

    • Run esxcli vsan cluster get to confirm all hosts have joined a single partition (e.g., "Sub-cluster Member Count: 4").

    • Verify that virtual objects transition from "Inaccessible" to "Healthy."

Additional Information

For more details on manual unicast agent management and troubleshooting, see KB 317830 - Network partition caused by an invalid/incomplete unicast agent list