High Latency observed on the cluster while restoring backup of a VM using Rubrik
search cancel

High Latency observed on the cluster while restoring backup of a VM using Rubrik

book

Article ID: 403717

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

While performing a restore of a large VM with the Rubrik Backup Appliance high latency is observed across all hosts and VMs in the cluster and the entire cluster freezes for a time.

Environment

  • VMware vSAN 8.0.x
  • vSAN deduplication
  • Rubrik

Cause

The rate of incoming throughput exceeded the throughput rate to de-stage the data from cache to capacity tier on a disk group for an extended period of time. This resulted in exhausting the write buffer beyond the congestion threshold which slows down incoming IO to avoid overrunning the write buffer .

Understanding Congestion in vSAN

Resolution

The following actions applied to the VM being restored can help to prevent congestion from occurring in this case:

  • Increasing the stripe width in the storage policy (Change stripe width)
    • There is no guarantee of improving the situation as the additional stripes are likely to remain on the same host and disk group.
  • Changing the storage policy to Raid 5 to spread more of the data write operations over additional host(s) (Define a storage policy  & Different FTT types)
    • This will result in the VM objects experiencing a vSAN resync operation to rebuild the objects to the new Raid 5 layout.
  • Disable vSAN deduplication to increase destaging speed (Disable deduplication)
    • This will result in a large vSAN resync operation as all disk groups must be programmatically recreated to the new format.
  • Limit the IOPS in the storage policy (Changing IOPS in storage policy)
    • It is recommended to perform testing and make adjustments to determine the best limit to maintain performance