Unusual high IO when vSphere Replication is enabled on vSAN clusters with TRIM/UNMAP enabled.
book
Article ID: 326772
calendar_today
Updated On:
Products
VMware vSAN
Issue/Introduction
This article is to written to advise about a known issue, and provide information on a fix.
Symptoms:
High IO is observed when vSphere Replication is enabled for VMs residing on vSAN datastore with the TRIM/UNMAP feature enabled.
Environment
VMware vSAN 7.0.x VMware vSAN 6.x
Cause
Running vSphere Replication in environments which have TRIM/UNMAP enabled, when a disk addition/removal occurs, an unmap is issued.
HBR will then forward the data from that disk for vSAN to perform an unmap. vSAN performs a maximum unmap of 2TB at a time.
In the case where a large disk removal occurs i.e., >2TB, vSAN performs an unmap which gets throttled and takes a few minutes to clean up the data before returning a response to remove the disk to HBR.
HBR then unmaps the entire disk (demand log section) at one time. This results in a stunned/unresponsive VM.
Further high UNMAP load is caused by the whole-length demand log truncation, triggered after starting replication and then on regular delta sync intervals.
Resolution
Upgrade vCenter/ESXi to version 7.0U2c or higher
Workaround:
Disable vSAN TRIM/UNMAP in clusters that are impacted.
There is no downtime or maintenance window required to disable the feature. It will take ~10-15 mins to complete.
Please open a case with support if assistance is needed
VMs reboots will not be required once the feature is disabled. However, as a precaution, we recommend rebooting the VMs whenever convenient for you.
Once UNMAP is disabled, you may see an increase in consumed space as deletes will no longer be reclaimed.
If you want the feature enabled again in the future, please be aware of the following:
UNMAP will only impact active deletes. Enabling it will not free up the previously issued deletes.
Once re-enabled, your VMs will need to be rebooted for UNMAP to take effect.
Additional Information
Impact/Risks:
Failure to address the issue can lead to performance issues with VMs on the datastore.
Utilizing the workaround process may result in higher space utilization on the datastore.