Troubleshooting Disaster Recovery as a Service (DRaaS)
search cancel

Troubleshooting Disaster Recovery as a Service (DRaaS)

book

Article ID: 320315

calendar_today

Updated On:

Products

VMware Cloud on AWS

Issue/Introduction

Symptoms:
Main KB for Disaster Recovery as a Service (DRaaS) using VMware Site Recovery (VSR) -- both Site Recovery Manager (SRM) and vSphere Replication

VMware Cloud (VMC)

Amazon Web Services (AWS)

Note: There are now two DRaaS solutions available for VMC on AWS. This KB is about VSR. Please see this blog post regarding VCDR (Datrium).

Resolution

Avoiding Problems using VSR with VMC on AWS

Note: With VSR problems, we quite often see more than one of the below issues in the environment. So, please check the whole list, rather than stopping after addressing one issue. Thanks!
  • Ensure version compatibility: Compatibility and Heterogeneous Configurations across the Paired Sites
  • Use lower-case FQDNs, instead of short names or IP addresses, in your on-premises environment for connectivity to Site Recovery Manager and vSphere Replication.
  • A VPN tunnel or Direct Connect link must be accessible by vSphere Replication. For cloud-to-cloud, a transit gateway would also work.
  • During installation, both on-premises and in the cloud, try to use as many default settings as appropriate.
  • The management gateway should be changed to resolve to an on-premises DNS server.
  • Ensure VMC on AWS management gateway firewall is open, in both directions, for SRM/VR/vCenter. More details here.
  • The list of VMC ESXi host IP addresses is not expected to remain the same. Thus, any on-premises firewall must include the range, rather than specified IP addresses.
  • Firewall rules for SRM in VMC are per instance, based on the extension ID. So, after creating a new Site Recovery instance, also create new management gateway firewall rules for that extension ID. One inbound rule and one outbound rule for vSphere Replication, plus rules for each SRM instance, are needed.
  • Both DNS and network connectivity need to be working in order for DRaaS to operate correctly. Use FQDNs in the SDDC troubleshooting tab for Site Recovery, so that both connection and DNS tests to on-premises systems are tested. The Site Recovery use case is only visible after activating Site Recovery in your cloud SDDC. Any on-premises firewall must allow DNS requests by the VMC management gateway address.
  • The VMC systems are already configured for NTP use. Please ensure NTP is both configured and operational among the following on-premises systems, and that this aligns with the VMC time.
    • ESXi Hosts
    • vCenter Server
    • SRM Server
    • vSphere Replication Appliance
  • Granting AD users DRaaS permissions KB 81856
  • Ensure none of the systems have a duplicate IP address.
  • Ensure the VMC on AWS management network segment does not overlap with any network segment on-premises, even if there are no duplicate IP addresses.
  • VMware Site Recovery and VMware vCenter Server, as well as the workloads they are protecting, require infrastructure services like DNS, DHCP, NTP, and Active Directory. These must be in place at both the protected and recovery sites.
  • When you force break a pair, it only cleans up the data in one site, so you need to login to the remote site and run the force break again.

 



Known Limitations
  • If your protected site is down, how will you access the recovery site network to run the recovery plan? Please consider this question in your planning and network design.
  • HCX DR, HCX bulk migration, and HCX Replication Assisted vMotion, all use vSphere Replication "under the hood" for the operations, so vSphere Replication cannot also be used on the same VM.
  • DRaaS is not compatible with vCenter Cloud Gateway Appliance.
  • Using quiescing for both vSphere Replication and backups, on the same VM, is not supported due to conflicts with more than one quiescing operation that could happen and colliding.
  • VMs with shared vDisks are not compatible with vSphere Replication.
  • SRM has same character limitations for passwords as vCenter.
  • Until vSphere Replication 8.3, the vDisk size cannot be changed without first removing replication. Please see What's New April 24, 2020 below.
  • Site Recovery Manager server call-out scripts and Pre-Power On Steps on the recovery Site Recovery Manager server on VMware Cloud on AWS are not supported.
  • Operational Limits of VMware Site Recovery
  • Limitations of Using VMware Site Recovery in a Multi-Site Topology
  • It is normal for VAMI access to fail for VSR appliances, because these are managed by VMware as part of the VMC service.
  • It is not supported to make changes within the guest of a VMware appliance. KB 2090839
 

Troubleshooting Tactics
  • If activation fails, it could be that some back end process has simply not completed. Please wait 15 minutes, then try again.
  • If there is a permissions problem when an operation is run by [email protected], then remove any extra permissions that have been added to this account. Such additions break the inheritance of DRaaS permissions.
  • If an operation fails when run by a user account, yet succeeds when run by [email protected], then focus troubleshooting on that user account settings.
  • Test port 443 connectivity to VMC systems from on-premises systems using the following steps. 
1) To get the FQDN of the DRaaS appliances, go to the vmc.vmware.com console for this SDDC, then look in the "Add Ons" tab. The "OPEN SITE RECOVERY" link goes to the vSphere Replication FQDN.
2) Output of testing with the following curl command, from on-prem appliances -- vCenter, SRM, and VRMS -- checks DNS, routing, firewall, and connectivity. Each of these commands should say that it connected, give some remote certificate information, then display what looks similar to the source of a small web page.
curl --verbose https://vcenter.sddc-##-##-##-##.vmwarevmc.com/
curl --verbose https://vr.sddc-##-##-##-##.vmwarevmc.com/
  • As a troubleshooting step, bypass any load balancer to rule out possible issues with that device.
 

Solutions for Common vSphere Replication Problems
Such as vCenter IP changes, etc.

When vSphere Replication detects that the VM is not supported for quiescing, then it disables that option in the GUI.

Note: Every time the amount of data changing in the VM is larger than normal, there will be an RPO violation because the sync took longer than expected. It is only a problem if the same VM gets RPO violations every sync. Then it could be any of the problems listed here: vSphere Replication RPO Violations
 

During an SDDC upgrade, DRaaS is not upgraded, as that is a separate process. The SDDC Upgrades and Maintenance document includes a section that details the impact of updates on VMware Site Recovery. Also, after phase one of the update, SDDC vCenters have new certificates and services will show "not connected" state, so please follow KB 78499 for the on-prem site to accept the new VMC vCenter/PSC certificate.
 

Relevant KBs regarding vDisk size:
  • Please also see the What's New April 24, 2020 note in the related information.
  • VM sync not starting only for a particular VM KB 2061047. This process removes the target data, so, for larger VMs, you may want to use the below KB instead.
  • Resize vSphere Replication Protected Virtual Machine Disk Files KB 77104
 

Relevant KBs regarding incompatible deployments:
  • Unable to create protection group. VRM Server 'vcenter.sddc-##-##-##-##.vmwarevmc.com' is not connected to its paired VRM Server. KB 81291
  • vSphere Replication service is unavailable during site pairing with VMC KB 80788
  • Granting AD users DRaaS permissions KB 81856
 

Relevant KBs regarding DRaaS issues caused by problems on-premises:
  • Unable to reverse replication for the virtual machine in DRaaS KB 79463
  • Unable to complete SRM Site Pairing -- Operation timed out: 300 seconds. KB 80885
  • Checking and clearing VM replication status. KB 2106946
 

Relevant KBs with benign symptoms:
  • Backup proxy stops replicating with message: A new disk was added to a replicated virtual machine. Replication will be paused until the new disk is configured for replication. KB 79247
  • Site Recovery activated, but SRM server not accessible KB 79283
  • Failed to connect to Lookup Service -- invocation failed with org.apache.http.conn.ConnectTimeoutException KB 79066
 

Known DRaaS issues from SRM release notes :
  • vCenter Server shows a warning for expiring evaluation license of an on-premises Site Recovery Manager instance even when paired with a Site Recovery Manager instance in VMware Cloud on AWS
When you pair your on-premises instance of Site Recovery Manager with a Site Recovery Manager instance in VMware Cloud on AWS, the Site Recovery Manager server uses the cloud license. 
 
Workaround: When the on-premises instance of Site Recovery Manager is paired with a cloud site, you can ignore the warning for expiring on-premises license.
 
  • The Summary page of the Site Recovery user interface displays an Unable to retrieve vSphere Replication summary data error message
In SDDC Version 1.8 and later, the vCenter Server Extension Manager does not return vSphere Replication data for users in the Cloud Admin group. When you open the Summary page of a site pairing, the value for the Domain Name/IP for the vSphere Replication appliance is blank and the following error message appears: Unable to retrieve vSphere Replication summary data.
 
Workaround: Ignore the error.
 




Additional Information

As a troubleshooting reference for other VMC on AWS issues, please see KB 77167
 

Product documentation for vSphere Replication

vSphere Replication overview -- This technical overview is well worth the short read. It is high-level, so it is good for managers, prospective users, and a good starting point for anyone ramping to support DRaaS.

vSphere Replication documentation

Understanding vSphere Replication Synchronization Types

vSphere Replication FAQ

vSphere Replication Target Storage Consumption

vSAN Storage Policy Assigned to vSphere Replication Replicas + KB 79833 as a work around

To get an approximation of the needed vSphere Replication bandwidth, you can use the vSphere Replication Calculator.
 

Product documentation for DRaaS

Indroduction to DRaaS (15-minute read)

OR, more detailed overview of DRaaS (27-minute read)

To compare and contrast DR solutions: Designing a VMware Cloud on AWS Disaster Recovery Solution (11-minute read)

For design considerations when configuring DRaaS with VSR: Designing a VMware Cloud on AWS Disaster Recovery Plan (18-minute read)

VMware Site Recovery Documentation

Site Recovery in an on-premises to cloud environment

Site Recovery in a cloud to cloud environment

VMware Site Recovery in a Multi-Site Topology

DRaaS FAQ (VMware Site Recovery)

DRaaS Roadmap (filter by “Disaster Recovery”)
 


Social media

@VMwareSRM

SRM community VMware Technology Network

Blog posts for all the disaster recovery options

 

Hands-on Labs

https://labs.hol.vmware.com

HOL-2187-02-ISM - VMware Cloud on AWS - Key Use Cases

The above HOL includes a module to simulate setup, replication, and running recovery plans for DRaaS. If you have used these simulations before, you know that you do not actually need to type the values for the fields because it will fill in the correct values regardless of what keys you press.
 

DRaaS digest from VMC on AWS release notes

What's New April 14, 2021

Faster re-protect:
Re-protect your virtual machines significantly faster after a planned recovery. The re-protection operation is especially quick when run shortly after the planned recovery such that the delta between the data on the source and recovery sites is not large. VMware Site Recovery now automatically starts tracking changes on the recovered virtual machine after failover. Only those changes are then replicated to the original protected site when re-protect is run and checksum comparisons can be completely avoided. This capability requires at least vSphere 7.0 Update 2 in your on-premises environment and VMware Cloud on AWS SDDC version 1.14. vSphere Replication 8.4 is also required in both sites.

What's New December 11, 2020 (SDDC Version 1.12v3)

Reduced time needed for reprotect:
The time needed for reprotecting virtual machines after a planned recovery with VMware Site Recovery has been reduced significantly. The reduction in time for reprotecting virtual machines is the largest when the delta between the data on the source site and recovery site is not large. This feature works for cloud-to-cloud DR topology and vSphere Replication on your VMware Cloud on AWS SDDC should be on version 8.3.2 or higher. You can read more about reprotecting virtual machines after a recovery in the VMware Site Recovery documentation.

What's New December 4, 2020 (SDDC Version 1.13)

Minimize security risks by enabling network encryption:
You can enable the network encryption of the replication traffic data for new and existing replications to enhance the security of data transfer. When the network encryption is enabled for a replication, an agent on the source encrypts the replication data on the source ESXi host and sends it to the vSphere Replication appliance on the target site. The vSphere Replication server decrypts the data and sends it to the target datastore. For more information about network encryption, see Network Encryption of Replication Traffic.

What's New June 25, 2020 

Multiple Points in time recovery:
​This feature allows the vSphere Replication administrator to configure the retention of replicas from multiple points in time. After a recovery, vSphere Replication presents the retained instances as ordinary virtual machine snapshots. Each replica is a Point in Time (PIT) to which you can revert the virtual machine. You can recover virtual machines at different points in time (PIT), such as the last known consistent state. You can configure the number of retained instances on the Recovery Settings page of the replication configuration wizards. You can view details about the currently retained instances in the replication details panel for a specific replication in vSphere Replication Outgoing and Incoming views.

What's New April 24, 2020 

Seamless disk re-sizing with vSphere Replication for VMware Site Recovery
Seamless disk re-sizing allows customers to increase the virtual disks of virtual machines that are configured for replication, without interruption of ongoing replication. The virtual disk on the target site will be automatically resized. For more information about the feature, see Increasing the Size of Replicated Virtual Disks.

What's New January 16, 2020 (SDDC Version 1.9)

VMware Site Recovery​
vSphere Replication Configuration Import/Export Tool: VMware Site Recovery™ now offers vSphere Replication Configuration Import/Export Tool, which can be used to export and import configuration data of replications in vSphere Replication. If you plan to migrate vSphere Replication configuration to a different host, you can use the tool to export replication settings and the related objects into an XML file. You can then import the configuration data from the previously exported file. You can find more details about the tool in VMware Site Recovery documentation covering Exporting and Importing Replication Groups Configuration Data.

​There are new known issues for DRaaS as a part of this release. Please visit VMware Site Recovery Release Notes  for more information.

What's New October 21st, 2019

VMware Site Recovery™ now supports replication of up to 1,500 virtual machines to a single target VMware Cloud™ on AWS Software Defined Data Center (SDDC), allowing you to protect larger environments. For more details, see Operational Limits of Site Recovery Manager in the VMware Site Recovery documentation.

What's New June 3rd, 2019 

VMware Site Recovery​
Enhancements to Site Recovery UI
Includes ability to import/export configuration, view capacity information in Protection Groups Datastores tab, monitor target datastores in the replication details pane and switch to a dark theme.
 

How To Get Help -- if the above steps and documentation have not resolved your issue.

For consultation, queries, and operational assistance:
For Technical Support troubleshooting of errors and faults -- guidance for opening a support request (SR) with VMware Technical Support:
  • Information to gather:
    • Problem description with error message or fault.
    • What operation was being attempted when the failure occurred?
    • When taking screenshots of error messages, etc., always include the URL in the image.
    • Note date, time, and time zone of each important step, error, or fault.
  • Methods to open an SR:

After the SR is created:
  • VMC on AWS has the logs for the cloud side, so please provide the on-premises logs, and specify if the on-premises site is being used as the protected site or the recovery site. KBs for log collection:
  • For issues running a recovery plan, etc., SRM logs
  • For issues with replication, vSphere Replication logs
  • For issues with authentication, etc., vCenter logs
  • VMware's server for case log uploads offers options. Please see the following VMware Knowledge Brief (KB) for upload instructions
The KB also includes troubleshooting steps for common upload issues.
 
Note: File listing is disabled to address security concerns, so you will not be able to see the files you have uploaded.