Recovering a Failed WSFC Node: Restoring VM from Backup and Reattaching Physical RDMs

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms/Scenarios:

Replacing a corrupted or failed WSFC node with a Veeam/Image-level backup.
The need to re-map original Physical Mode RDMs to a new VM object.
Uncertainty regarding the impact of powering off or deleting the original failed VM on the surviving cluster node.

This article provides guidance on replacing a failed node within a Windows Server Failover Cluster (WSFC) using a restored virtual machine while maintaining storage integrity for shared Raw Device Mapping (RDM) disks.

When a primary cluster node (e.g., DB01) fails and is restored from backup, administrators must ensure that the restored instance correctly reattaches to existing RDMs without disrupting the surviving active node (e.g., DB02). A critical concern during this process is avoiding the accidental deletion of RDM pointer files (.vmdk) and maintaining SCSI reservation consistency to prevent cluster downtime or data corruption.

Environment

VMware vSphere 8.x

Cause

The complexity of this recovery arises from the way VMware handles shared storage for clustering:

SCSI Reservations: WSFC relies on SCSI-3 Persistent Reservations. If the "Active" node is powered off without migrating roles, the reservation may hang or become inaccessible.
Pointer Dependency: In many WSFC configurations, "Node B" (DB02) points to RDM metadata files located in the folder of "Node A" (DB01). Deleting the folder of the failed node from the datastore can inadvertently break the storage paths for the surviving node.
Hardware Configuration Mismatch: If the restored VM is not configured with identical SCSI IDs and Bus Sharing settings, the Windows OS will fail to recognize the disks as part of the existing cluster, leading to "Disk Offline" or "Inaccessible" errors within the Cluster Manager.

Resolution

Phase 1: Pre-Recovery Validation

Before modifying the environment, ensure the surviving node is stable and self-sufficient.

Move Cluster Roles: Open Failover Cluster Manager on DB02 and verify that it owns all Cluster Roles and Resources (Disks, IP, Quorum).
Verify Disk Pointer Paths:
- On DB02, go to Edit Settings.
- Select each RDM disk and check the File Location.
- Scenario A: Path is [Datastore] DB02/DB02.vmdk. DB02 is self-sufficient.
- Scenario B: Path is [Datastore] DB01/DB01_1.vmdk. Warning: DB02 depends on DB01’s folder. Do not delete the DB01 folder from the datastore until these paths are redirected.

Phase 2: Detaching RDMs from the Original Node

To prevent metadata conflicts, the original VM must release the disks properly.

1. Power Off: Shut down the original DB01 (failed node).
2. Remove Disks:
  - Right-click DB01 > Edit Settings.
  - Identify the RDM disks. Click the X to remove them.
  - Critical: Select "Remove from virtual machine".
  - Do NOT select "Delete files from datastore", as this will destroy the RDM pointer files needed for the restored VM.
3. Inventory Cleanup: If you intend to reuse the VM name, right-click DB01 and select Remove from Inventory.

Phase 3: Configuring the Restored VM

Once the VM is restored from backup, it must be "plumbed" correctly to talk to the shared storage.

1. Add SCSI Controller:
  - Right-click the Restored VM > Edit Settings.
  - Add a New SCSI Controller (Match the type of DB02, e.g., VMware Paravirtual).
  - Set SCSI Bus Sharing to Physical.
2. Attach Existing RDMs:
  - Select Add New Device > Existing Hard Disk.
  - Browse to the datastore and select the original .vmdk pointer files removed in Phase 2.
3. Match SCSI IDs:
  - Ensure each disk is assigned to the exact same SCSI Unit ID used previously (e.g., 1:0, 1:1).
  - The SCSI ID on the Restored VM must match the SCSI ID assigned on DB02.

Phase 4: Guest OS and Cluster Re-entry (Note: If OS-level issues persist, consulting Microsoft Support may be required)

Final steps within Windows to ensure the Cluster Service starts without errors.

1. Network Cleanup:
  - Open Device Manager on the restored VM.
  - Select View > Show hidden devices.
  - Uninstall any "Ghost" Network Adapters (old grayed-out NICs) and reconfigure the static IP on the new NIC.
2. Disk Verification:
  - Open Disk Management. The RDMs should appear as Reserved or Offline (controlled by the cluster).
3. Cluster Join:
  - Open Failover Cluster Manager.
  - Start the Cluster Service and verify the node successfully joins the cluster and can take over ownership of disks if a failover is initiated.