1. Virtual machines freeze during incremental syncs or SYNC NOW operations
2. VM recovers from a frozen state when you PAUSE replications
3. Increasing the RPO to the higher side does not fix the problem
VMware vSphere 8.0.3, 24674464
VMware vSphere Replication 9.0.2.2, 24628359
The DemandlogMaxCowSizeMB option introduced a defect where DemandlogFailCollidingUnmap (default: 1) no longer works as intended.
When DemandlogFailCollidingUnmap is set to 1, the hbr filter should report "busy" to the guest OS for SCSI UNMAP commands. However, due to the defect, SCSI UNMAP commands are delayed until overlapping ranges are copied to the demand log, potentially causing the guest OS to freeze.
Workaround: Set DemandlogMaxCowSizeMB to 0 to disable it.
Verify the current configuration settings by running these commands -
esxcli system settings advanced list -o /HBR/DemandlogMaxCowSizeMBPath: /HBR/DemandlogMaxCowSizeMB Type: integer Int Value: 4096 Default Int Value: 4096 (Default value) Min Value: 0 Max Value: 1048576 String Value: Default String Value: Valid Characters: Description: The summary size of pending COW operations in MB. Reaching this limit can cause further colliding VM IOs to fail. Host Specific: false Impact: none
esxcli system settings advanced set -o /HBR/DemandlogFailCollidingUnmap Path: /HBR/DemandlogFailCollidingUnmap Type: integer Int Value: 1 Default Int Value: 1 (Default value) Min Value: 0 Max Value: 5 String Value: Default String Value: Valid Characters: Description: Fail demand log transaction on WRITE_SAME and UNMAP command collision. Host Specific: false Impact: none
DemandlogFailCollidingUnmap is set to 1 by default on the ESXi host.
"demandlog_fail_colliding_unmap":1
DemandlogMaxCowSizeMB is set to 4096 MB by default on the ESXi host.
"demandlog_max_cow_size_MB":4096
Modify these settings by running these commands -
esxcli system settings advanced set -o /HBR/DemandlogMaxCowSizeMB -i 0
esxcli system settings advanced set -o /HBR/DemandlogFailCollidingUnmap -i 1
Make these changes to all the hosts used for replication. This should fix the freezing issue.
This issue will be fixed in ESX 8.0.3 Patch 7 and 9.0.1 onwards.