[VMC] Troubleshooting Disaster Recovery as a Service (DRaaS)
search cancel

[VMC] Troubleshooting Disaster Recovery as a Service (DRaaS)

book

Article ID: 320315

calendar_today

Updated On:

Products

VMware Cloud on AWS VMware Site Recovery Manager

Issue/Introduction

Understanding and troubleshooting DRaaS using VMware Site Recovery (VSR). This includes both Site Recovery Manager (SRM) and vSphere Replication. 

Note: There are two DRaaS solutions for VMC on AWS. This article focuses on VSR. 

Environment

VMC on AWS
VMware Site Recovery

Resolution

Avoiding Problems using VSR with VMC on AWS

  • Ensure version compatibility: Compatibility and Heterogeneous Configurations across the Paired Sites
  • Use lower-case FQDNs instead of short names or IP addresses in your on-premises environment for connectivity to Site Recovery Manager and vSphere Replication.
  • A VPN tunnel or Direct Connect link must be accessible by vSphere Replication. For cloud-to-cloud, a transit gateway would also work.
  • During installation, both on-premises and in the cloud, try to use as many default settings as appropriate.
  • The management gateway should be changed to resolve to an on-premises DNS server.
  • Ensure VMC on AWS management gateway firewall is open, in both directions, for SRM/VR/vCenter. More details here.
  • The list of VMC ESXi host IP addresses is not expected to remain the same. Any on-premises firewall must include the range rather than specified IP addresses.
  • Firewall rules for SRM in VMC are per instance and based on the extension ID. After creating a new Site Recovery instance, create new management gateway firewall rules for that extension ID. One inbound rule and one outbound rule for vSphere Replication, plus rules for each SRM instance, are needed.
  • Both DNS and network connectivity need to be working in order for DRaaS to operate correctly. Use FQDNs in the SDDC troubleshooting tab for Site Recovery so that both connection and DNS tests to on-premises systems are tested. The Site Recovery use case is only visible after activating Site Recovery in the cloud SDDC. Any on-premises firewall must allow DNS requests by the VMC management gateway address.
  • The VMC systems are already configured for NTP use. Ensure NTP is both configured and operational among the following on-premises systems and that this aligns with the VMC time.
    • ESXi Hosts
    • vCenter Server
    • SRM Server
    • vSphere Replication Appliance
  • Granting AD users DRaaS permissions.
  • Ensure none of the systems have a duplicate IP address.
  • Ensure the VMC on AWS management network segment does not overlap with any network segment on-premises, even if there are no duplicate IP addresses.
  • VMware Site Recovery and VMware vCenter Server, as well as the workloads they are protecting, require infrastructure services like DNS, DHCP, NTP, and Active Directory. These must be in place at both the protected and recovery sites.
  • When you force break a pair, it only cleans up the data in one site. It is necessary to login to the remote site and run the force break again.
  • Every time the amount of data changing in the VM is larger than normal, there will be an RPO violation because the sync took longer than expected. This is only a problem if the same VM gets RPO violations every sync. 

Note: With VSR problems, one ore more of the above issues can be seen in the environment. Check the entire list before addressing just one issue.


Known Limitations

  • If the protected site is down, how will you access the recovery site network to run the recovery plan? Consider this question in your planning and network design.
  • HCX DR, HCX bulk migration, and HCX Replication Assisted vMotion, all use vSphere Replication "under the hood" for the operations. vSphere Replication cannot also be used on the same VM.
  • DRaaS is not compatible with vCenter Cloud Gateway Appliance.
  • Using quiescing for both vSphere Replication and backups on the same VM is not supported due to conflicts with more than one quiescing operation that could happen and colliding.
  • When vSphere Replication detects that the VM is not supported for quiescing, it disables that option in the GUI.
  • VMs with shared vDisks are not compatible with vSphere Replication.
  • SRM has same character limitations for passwords as vCenter.
  • Until vSphere Replication 8.3, the vDisk size cannot be changed without first removing replication.
  • Site Recovery Manager server call-out scripts and Pre-Power On Steps on the recovery Site Recovery Manager server on VMware Cloud on AWS are not supported.
  • Operational Limits of VMware Site Recovery
  • It is normal for VAMI access to fail for VSR appliances because these are managed by Broadcom as part of the VMC service.
  • It is not supported to make changes within the guest of a VMware appliance. VMware Virtual Appliances and customizations to operating system and included packages

Troubleshooting Tactics

  • If activation fails, it could be that a backend process has not completed. Wait 15 minutes and then try again.
  • If there is a permissions problem when an operation is run by "[email protected]" and then remove any extra permissions that have been added to this account. Such additions break the inheritance of DRaaS permissions.
  • If an operation fails when run by a user account, but succeeds when run by "[email protected]", focus troubleshooting on that user account settings.
  • Test port 443 connectivity to VMC systems from on-premises systems using the following steps. 
1) To get the FQDN of the DRaaS appliances, go to the VMC console for the SDDC, then look in the "Add Ons" tab. The "OPEN SITE RECOVERY" link goes to the vSphere Replication FQDN.
2) Output of testing with the following curl command, from on-prem appliances -- vCenter, SRM, and VRMS -- checks DNS, routing, firewall, and connectivity. Each of these commands should say that it connected, give some remote certificate information, then display what looks similar to the source of a small web page.
curl --verbose https://vcenter.sddc-##-##-##-##.vmwarevmc.com/
curl --verbose https://vr.sddc-##-##-##-##.vmwarevmc.com/
  • As a troubleshooting step, bypass any load balancer to rule out possible issues with that device.

SDDC Upgrades

  • During an SDDC upgrade, DRaaS is not upgraded and is a separate process.
  • SDDC Upgrades and Maintenance includes a section that details the impact of updates on VMware Site Recovery.
  • After phase one of the update, SDDC vCenters will have new certificates and services will show "not connected" state.
  • Follow SRM actions after SDDC upgrade or certificate updates for the on-prem site to accept the new VMC vCenter/PSC certificate.

Additional Information

Documentation regarding incompatible deployments:

Documentation regarding issues caused by on-premise:

Documentation regarding benign symptoms:

Release notes: