vSAN Cluster Recovery Fails with "Timeout, please try again later" in VMware Cloud Foundation (VCF) Tanzu Workload Domain
search cancel

vSAN Cluster Recovery Fails with "Timeout, please try again later" in VMware Cloud Foundation (VCF) Tanzu Workload Domain

book

Article ID: 432518

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Following a planned power maintenance or cold start of a VMware Cloud Foundation (VCF) environment, the vSAN cluster in a Tanzu Workload Domain may fail to initialize.

Symptoms:

  • Executing the recovery script python /usr/lib/vmware/vsan/bin/reboot_helper.py recover returns the error: Timeout, please try again later.
  • The vSAN cluster remains partitioned, with esxcli vsan cluster get showing a Sub-Cluster Member Count: 1 on all nodes despite network connectivity.
  • The command localcli vsan network list returns no output, indicating the vSAN traffic tag is missing from the VMkernel adapters.

Environment

VMware vSAN 8.x

Cause

The vSAN traffic tag was lost or failed to persist on the designated VMkernel adapter (e.g., vmk3) during the host reboot cycle. Without the active vSAN traffic type enabled on the interface, the ESXi hosts cannot participate in the vSAN transport layer, preventing the cluster from forming a single partition.

Resolution

To resolve this issue, the vSAN network configuration must be manually re-asserted on each host to restore the traffic tags before re-running the recovery script.

  1. Confirm which VMkernel is intended for vSAN traffic (commonly vmk3 in VCF Tanzu domains) by checking the IP assignments: esxcfg-vmknic -l

  2. Run the following command on each host. If the output is blank, the tag is missing: localcli vsan network list

  3. On each host in the cluster, clear the vSAN network stack and re-add the specific VMkernel with the vSAN tag:
    esxcli vsan network clear
    esxcli vsan network ipv4 add -i vmk#  (Note: Replace vmk# with the appropriate interface identified in Step 1 if different.)

  4. Ensure the "Traffic Type" now shows vsan: esxcli vsan network list

  5. From one of the ESXi hosts, re-run the reboot helper script: python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
    Example output:
    Begin to recover the cluster ...

    The cluster has been recovered successfully.
    Successfully resumed the cluster.
  6. Verify the object health to ensure no data is inaccessible: esxcli vsan debug object health summary get