VMs configured for replication are unexpectedly powered off during snapshot-based backup operations
search cancel

VMs configured for replication are unexpectedly powered off during snapshot-based backup operations

book

Article ID: 431361

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • A virtual machine (VM) is configured for replication via vSphere Replication or vCloud Director Availability
  • The vmkernel port used by the ESXi to connect with the replication software is disabled, but is not being deleted
  • Instead a new vmkernel port is being created to connect
  • During an attempt to create a snapshot-based backup of the VM, it is unexpectedly powered off
  • While reviewing the VMX log in the virtual machine folder (/vmfs/volumes/<datastore>/<vm_name>/vmware.log), errors similar to the example below can be seen:
    ..
    <timestamp> In(05) vcpu-0 - Msg_Post: Error
    <timestamp> In(05) vcpu-0 - [msg.scsi.esx.filterAttachmentFailed] Failed to attach filter 'hbr_filter' to 'scsi0:0': Not found (195887107).
    <timestamp> In(05) vcpu-0 - [msg.scsi.esx.reopenFailed] Failed to reopen disk '/vmfs/volumes/<datastore>/<vm_folder>/<vm_name>-000001.vmdk'.
    <timestamp> In(05) vcpu-0 - ----------------------------------------
    <timestamp> In(05) vcpu-0 - CPT: error syncing group SCSI0
    ..
    <timestamp> In(05) vcpu-0 - Msg_Post: Error
    <timestamp> In(05) vcpu-0 - [msg.checkpoint.continuesync.error] An operation required the virtual machine to quiesce and the virtual machine was unable to continue running.
    <timestamp> In(05) vcpu-0 - ----------------------------------------
    <timestamp> In(05) vcpu-0 - SnapshotVMXTakeSnapshotCB: Failed to quiesce for snapshotting. (mode=1, error=0)
    <timestamp> In(05) vcpu-0 - SnapshotVMXTakeSnapshotComplete: Done with snapshot '<snapshot_name>': ##
    <timestamp> In(05) vcpu-0 - SnapshotVMXTakeSnapshotComplete: Snapshot ## failed: Unable to save snapshot file (13).
    <timestamp> In(05) vcpu-0 - SnapshotVMXTakeSnapshotComplete: Cleaning up incomplete snapshot ##.
    <timestamp> In(05) vcpu-0 - SnapshotVMXTakeSnapshotComplete: Deleting incomplete snapshot ##.
    ..
  • At the same time, the vmkernel warning log, /var/run/log/vmkwarning.log, has entries similar to the ones below:
    ..
    <timestamp> Wa(180) vmkwarning: cpu##:### opID=###)WARNING: Hbr: 1003: Failed to get the netstack for vmknic vmk#: Not found
    <timestamp> Wa(180) vmkwarning: cpu##:### opID=###)WARNING: Hbr: 5756: Failed to create transport to <IP_of_the_replication_software>(groupID=H4-########-####-####-####-############): Not found
    <timestamp> Wa(180) vmkwarning: cpu##:### opID=###)WARNING: Hbr: 8694: Failed to create NetWorker (groupID=H4-########-####-####-####-############): Not found
    <timestamp> Wa(180) vmkwarning: cpu##:### opID=###)WARNING: Hbr: 284: Failed to allocate disk info (diskID=H4D-########-####-####-####-############): Not found
    <timestamp> Wa(180) vmkwarning: cpu##:### opID=###)WARNING: VSCSIFilter: 207: handle ###(GID:###)(vscsi0:0):Error attaching filter 'hbr_filter' to VSCSI_Handle 0x###: Not found
    <timestamp> Wa(180) vmkwarning: cpu##:### opID=###)WARNING: VSCSI: vm ###: 5579: Attaching filter 'hbr_filter' on scsi0:0 failed: Not found (195887107)
    ..
  • When reviewing the vmkernel port configuration, you see that a vmkernel port was disabled, but not deleted, while another one exists with the same IP network configuration

Environment

VMware vSphere ESXi 8.0.x

Cause

This issue occurs because the vmkernel port while being disabled, still exists. Since vSphere 8.0, ESXi has a logic that enumerates the existing NICs for a connection to the replication software when one of them is deleted, in order to download the hbr_filter.

That logic does not happen when the port is only disabled, but not removed - instead ESXi will try to route the download via this port again, and fails. Because the filter cannot be downloaded, it can't be attached to the disk. The resulting disk access failure causes an incomplete snapshot, which introduces a risk of data corruption, and in order to prevent such corruption, ESXi will power the virtual machine off.

Resolution

In order to prevent such issues, please do refrain from disabling the vmkernel port used by ESXi to connect to the replication software.

Instead, when configuring a new vmkernel port for replication, please make sure to delete the old one completely in order to allow ESXi to enumerate the new route to the replication system.