A replicated VM becomes unresponsive or cannot serve network requests
search cancel

A replicated VM becomes unresponsive or cannot serve network requests

book

Article ID: 386695

calendar_today

Updated On:

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

 

A. VM Stops Responding to Pings

After enabling replication for the VM or resuming the VM from a 'paused' replication status, the VM stops responding to ping requests. The following command, executed from SSH on the ESXi host where the VM is running, confirms that the vNIC is not replying to ping requests:

pktcap-uw --switchport xxxx --proto 0x01 --capture VnicTx -o - | tcpdump-uw -enr - host xxx.xxx.xxx.xx (VM IP)


B.
VRMC Console Session Freezes

The previously active VRMC console session to the VM freezes and stalls.

C. Windows Guest OS SCSI Adapter Reset

If the VM is running a Windows guest OS, the system event log may log a local SCSI adapter reset shortly before the console session freezes.

D. VM Liveness Resumes After Disabling/Pausing Replication

Disabling or pausing the replication of the VM causes the VM to resume normal operation.

Environment

vSphere Replication 8.x
vSphere Replication 9.x

Cause

Overview of vSphere Replication with Unmap commands.

How vSphere Replication Works When Using Guest OS Trim/Unmap Commands

 

ESXi Advanced Option "DemandlogFailCollidingUnmap" History

Prior to ESXi 7.0.3:

The hbr_filter driver accommodated collisions (sync operation with unmap from guest OS) by triggering on-demand copies of the blocks.*1

Drawbacks: This approach required additional READ and WRITE operations to preserve the overlapping regions, potentially leading to unexpected delays.

Note: This behavior is similar to the later introduced advanced option "DemandlogFailCollidingUnmap" with option 0.


In ESXi 7.0.3:

The DemandlogFailCollidingUnmap option was introduced in ESXi 7.0.3.

  • Default Option: 1
    • With the default option, the hbr_filter driver blocks SCSI UNMAP commands from the guest OS during a sync operation if these commands overlap with the content being transferred to the target site.
    • The driver reports a 'busy' status (SCSI event status code 08h) to the guest OS, assuming that the guest OS will retry the trim/unmap commands later, preventing any impact on applications running in the VM.

Between ESXi 7.0.3 and 7.0.3 P07, ESXi 8.0 and 8.1:*2

The DemandlogFailCollidingUnmap option had two values: 0 and 1.

  • To check the current setting of the configuration on ESXi 7.0.3 and later:
  $ esxcli system settings advanced list -o /HBR/DemandlogFailCollidingUnmap
  • To change the option:
  $ esxcli system settings advanced set -o /HBR/DemandlogFailCollidingUnmap -i 0
  • To revert to the default mode:
  $ esxcli system settings advanced set -o /HBR/DemandlogFailCollidingUnmap -i 1

 

For ESXi 7.0.3 P08 or later 7.0.3 patches, and ESXi 8.0.2 or later:

The DemandlogFailCollidingUnmap configuration option has been updated to include two additional values (2 and 3). These new values provide more options for the driver's response to unmap/trim commands from the guest OS.

New Values in the DemandlogFailCollidingUnmap Configuration Option:

  • Option value set to  "2": 'Check Condition' Error (SCSI event status code 02h)

    • The hbr_filter driver returns a 'check condition' error with no additional sense data. This causes the guest OS to fail the command silently without immediate retry attempts.
    • SSH Command:
    $ esxcli system settings advanced set -o /HBR/DemandlogFailCollidingUnmap -i 2
    
  • Option value set to "3: 'Success' Response (SCSI event status code 00h)

    • The hbr_filter driver returns a success status to the guest OS's trim/unmap command.

         SSH Command:

    $ esxcli system settings advanced set -o /HBR/DemandlogFailCollidingUnmap -i 3

Resolution

Workaround for Guest Unresponsiveness in the Environment

If guest unresponsiveness is noticed, customers can use either of the following workarounds:

a. Disable Guest OS Unmap

  • Windows OS: i.e. 'DisableDeleteNotify = 1' , please refer to this Microsoft Learn link for detail fsutil usage 
  • Linux OS: Refer to the vendor's manual for 'fstrim' usage.

b. Adjust ESXi Host Advanced Settings

  • Set the DemandlogFailCollidingUnmap option to either 0, 2, or 3. As of the time of the Knowledge Base article, customers with certain Windows versions have reported that changing the value to '2' resolved their issue. The advanced settings will take effect immediately, and no host reboot is required.

Additional Information

*1. The LWD(Light Weight Delta) snapshot will allow hbr_filter has the bitmap of blocks it needs to send . When the guest OS is modifying the blocks before the hbr_filter driver send them , the driver will block the IO and copy the data off the disk to a 'Demand log' ,then, the driver will unblock the guest IO for the blocks to let it proceed. Once all blocks sent to the target, then the data in the Demand log will be drained in the order when guest OS has 'collision' on. Sometimes , this is referring as "on-demand copy" or "copy on write".

*2. Click here for VMware ESXi build numbers: VMware ESXi build numbers