vSphere Replication Sync Failures: "Filter Failed" and "DiskQueue Stopped" Errors

Products

VMware vSphere ESXi VMware Live Recovery

Issue/Introduction

A Virtual Machine (VM) is experiencing intermittent replication warnings resulting in persistent RPO (Recovery Point Objective) violations. While some synchronization sessions may complete successfully, the majority fail with a specific filter error.

Symptoms:

Replication status displays: Sync stopped: Filter Failed.
The issue is isolated to a specific high-capacity VM within an environment where other replications are functioning normally.
System logs indicate a failure in the Host-Based Replication (HBR) update command.

Environment

VMware Live Recovery

Cause

The failure is caused by a stalled Host-Based Replication (HBR) filter on the ESXi host. The vmkernel.log reveals that the REPLICA_UPDATE command failed because the DiskQueue was stopped during dispatch.

Log Snippet:

WARNING: Hbr: ####: Command REPLICA_UPDATE failed (result=Failed) (isFatal=FALSE) (Id=-#########) (GroupID=GID-########-####-####-####-############) Hbr: ####: Command: REPLICA_UPDATE: error result=Failed gen=-1: Error for (diskId: "RDID-########-####-####-####-############"), (flags: do-not-report): DiskQueue is stopped during dispatch.

This state typically occurs when the filter cannot keep pace with I/O or encounters metadata inconsistencies on very large disks, leading the host to suspend the dispatch queue for that specific replication ID.

Resolution

To resolve the stalled filter and clear the dispatch error, the replication metadata must be refreshed by re-creating the replication task. Follow these detailed steps to ensure a clean reconfiguration:

Stop and Remove Existing Replication:
- Log into the vSphere Client.
- Navigate to Shortcuts > Replication.
- Locate the affected VM (XXXXXXXXXXXXXXX) under Monitor > Outgoing Replications.
- Select the VM, click Actions, and choose Remove. Select the option to "Remove replication from both sites" to ensure the management database is cleared.
Verify Target Site Cleanup:
- Browse the datastore at the target site where the replica resides.
- Ensure there are no orphaned .hbr files or stale temporary tracking files in the VM's folder.
- Note: Do not delete the primary .vmdk files if you intend to use them as seeds.
Reconfigure with Replication Seeds (Recommended for 5TB+):
- Right-click the source VM and select vSphere Replication > Configure Replication.
- Follow the wizard to the Target Location section.
- Select the same datastore and folder where the previous replica disks reside.
- When prompted, select "Use existing disks as replication seeds." This allows the system to perform a checksum comparison rather than a full 5TB data transfer.
Monitor Initial "Force Re-sync":
- Once the task is saved, the VM will enter a "Initial Full Sync" or "Checksum" state.
- Monitor the Events tab for the VM to ensure the HBR Filter initializes correctly.
- Check vmkernel.log on the source host to confirm the DiskQueue is now in a Started or Dispatching state.

Additional Information

While a simple "Reconfigure" or "Restart" of the replication management service (RMS) might occasionally clear the alert, it often fails to reset the underlying DiskQueue state on the ESXi host for high-capacity disks. A full removal and re-addition of the replication task is the most reliable method to ensure a clean filter state and prevent recurring RPO violations.