Restoring vSAN File Service Following a Site Failure
search cancel

Restoring vSAN File Service Following a Site Failure

book

Article ID: 439976

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • In a vSAN stretched cluster environment, vSAN File Service objects configured with site failure tolerance become inaccessible due to a loss of quorum if both a site and the witness node fail simultaneously.
  • While you may seek to perform a forced takeover of the surviving site when recovery of the failed components is not feasible, the current vSAN site force takeover feature does not include support for recovering vSAN File Service objects.
  • This article outlines the necessary procedures to restore vSAN File Service functionality following such a critical site failure

NOTE: Prior to performing the resolution procedures outlined here, ensure that neither the witness host nor the hosts from the failed site reconnect to the network.

Environment

VMware vSAN 9.1 or above

Resolution

  1. On one ESXi host of the surviving site, run /usr/lib/vmware/vsan/bin/site-takeover.
  2. Identify the vSAN file service objects in the output file.
    If the vCenter Server is a VM residing on this vSAN datastore and has lost accessibility, power it on and reconnect it to the cluster.
  3. Unmount all file shares on the vSAN File Service clients.
  4. Stop the vsan-health service on the vCenter Server by executing the below command: 

     vmon-cli -k vsan-health

     5. Uninstall the vSAN File Service EAM agency and wait for the vSAN File Service VMs on accessible hosts to be removed.

Option 1: Uninstall vSAN File Service EAM Agency via Script

    • Copy the attached script to the vCenter Server and execute the following command:
      python DeleteFileServiceEAMAgency.py -s localhost -u <admin username> -p <admin password> --cluster <cluster name>

Option 2: Uninstall vSAN File Service EAM Agency via vCenter UI

    1. Navigate to the EAM UI in vCenter:
      Administration -> Solutions (vCenter Server Extensions) -> vSphere ESA Agent Manager -> Configure → ESX Agencies.
    2. Locate the agency named "vsan-file-services" associated with the recovered vSAN cluster. Ensure the state is "Enabled" and select "Delete Agency".

     6. Shutdown vSAN File Service daemons on each ESXi host

    • Stop the vdfsd and fsvmsockrelay daemons on the ESXi host:
      /etc/init.d/vdfsd stop
      /etc/init.d/fsvmsockrelay stop

     7. Reboot all vSAN File Service clients.

     8. Forced Recovery and Site Takeover

    • Enable the recovery of file service objects during the site takeover process:
      vsish -e set /config/VSAN/intOpts/ClomEnableRecoveryOfSkippedObjs 1
      /usr/lib/vmware/vsan/bin/site-takeover
    • Once the site takeover is complete, restore the original setting:
      vsish -e set /config/VSAN/intOpts/ClomEnableRecoveryOfSkippedObjs 0

     9. Final Service Restoration

    • Restart the vsan-health service on the vCenter Server: "vmon-cli -i vsan-health".
    • This should trigger an automatic File Service remediation. If it does not, manually start the remediation from the "Infrastructure Health" section of the vSAN health UI and wait for the VMs to deploy.
    • Finally, re-mount the file shares on the vSAN File Service clients.