Troubleshooting Storage Array Integrated VM Backups For Rubrik / Veeam

search cancel

Troubleshooting Storage Array Integrated VM Backups For Rubrik / Veeam

book

Article ID: 394775

calendar_today

Updated On: 04-18-2025

Products

VMware vSphere ESXi

Issue/Introduction

This KB is to help understand the workflow behind backups with storage array integration to assist in troubleshooting backup issues and failures. This is an emerging technology leveraged by backup providers like Rubrik, Veeam, Cohesity, and others. This technology requires a storage system with a compatible array-based snapshot functionality. Pure Storage, NetApp, EMC, HP 3Par and others all regularly support this.

In a normal workflow that leverages VMware snapshots for backups a snapshot is taken and then via NFC or other method the vmdk disk data is copied to a backup provider repository. While this copy happens the snapshot disk for the VM is the active file system and will continue to grow in size until the backup completes and the call is sent to delete the snapshot. On very busy systems these snapshot sizes can grow rapidly.

Storage array integrated backups minimize the amount of time needed for VMware snapshots to exist by using the following workflow. We will use Rubrik as the example:

1. Rubrik calls vCenter to create a snapshot of a virtual machine, this task is handed off to the ESXi host and the snapshot is generated.

2. vCenter confirms snapshot creation back to Rubrik.

3. Rubrik calls the storage array to take a storage system snapshot of the LUN(s) associated with the VMDK disks of this virtual machine.

4. Storage array confirms snapshot is taken.

5. Rubrik calls vCenter to delete the vm snapshot that was just created. (Typically, vm snaps live for about 6-10 seconds before deletion)

6. Rubrik calls the storage array to present the snapshot as a new LUN, and map it to an ESXi host.

7. Storage array confirms creation of LUN and mapping.

8. Rubrik calls vCenter/ESXi to rescan and re-signature the new volume (preventing a duplicate datastore uuid), add the vm targeted in the backup job into inventory from the snapshot backed LUN and copy the data off back to Rubriks repository.

9. Once Rubrik has copied the data it will send calls to vCenter first to remove the vm from inventory and unmount the datastore.

10. Once confirmation of step 9 is received Rubrik will call the storage array to unmask the LUN and then delete it.

11. Once unmask/deletion are confirmed a storage rescan for the ESXi host will be performed to clear the target LUN

Resolution

With this high level of communication between multiple infrastructure components unhealthy connections can and will cause problems with that communication and failures can occur.

For example host disconnects, or high network/storage latency can cause jobs to fail and cleanup processes to not happen leaving orphaned snapshots/LUNs/duplicate virtual machines.

In some cases customers will report that the vmware snapshot was not deleted for 10+ minutes even though we expect it to be deleted within seconds of creation. We would investigate when we see the calls for creation and deletion occur (hostd and vmware log) to determine if we are being slow on providing those actions. In some cases if a storage system fails to deliver on its part of the workflow the backup software will default to copying the backup data off the live virtual machine which will hold the snapshot in place until the backup completes.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No