vSphere Replication Troubleshooting

search cancel

vSphere Replication Troubleshooting

book

Article ID: 308573

calendar_today

Updated On:

Products

VMware Live Recovery VMware vSphere ESXi VMware Cloud on AWS

Issue/Introduction

The purpose of this article is to assist with the troubleshooting of vSphere Replication issues and contains many frequently asked questions.

Resolution

What is Host Based Replication (HBR)?

There is a filter installed on 5.x and above ESXi servers, the HBR filter. Its purpose is to push VM replication data to the vSphere Replication Appliance(s). You can see this filter by running this command:

# vmkload_mod -l

It will return this output:

hbr_filter

As this filter runs on the ESXi server, it is important to note that the HBR filter uses hostd resources, and it has its own command set within vim-cmd

Note: Currently it does not yet have an esxcli equivalent

These commands are available on to manage HBR on an ESXi host

The commands under hbrsvc/ are:

vmreplica.abort

vmreplica.create

vmreplica.disable

vmreplica.diskDisable

vmreplica.diskEnable

vmreplica.enable

vmreplica.getConfig

vmreplica.getState

vmreplica.pause

vmreplica.queryReplicationState

vmreplica.reconfig

vmreplica.resume

vmreplica.startOfflineInstance

vmreplica.stopOfflineInstance

vmreplica.sync

Usage:

To use these commands, you must first acquire the VM ID of the virtual machine you wish to troubleshoot.

The VM ID is used to uniquely identify this virtual machine on an ESXi host.

To get the VM ID, run this command:

# vim-cmd vmsvc/getallvms

Once you have the VM ID you can query the replication state of a chosen virtual machine by running this command:

# vim-cmd hbrsvc/vmreplica.getState 1

In the example below the VM ID is 1.

/vmfs/volumes/51cb2399-2692ecca-8682-000c299d035f/VM # vim-cmd hbrsvc/vmreplica.getState 1

If replication is not configured on this virtual machine, you see an output similar to:

Retrieve VM running replication state:

(vim.fault.ReplicationVmFault) {

dynamicType = ,

faultCause = (vmodl.MethodFault) null,

reason = "notConfigured",

state = ,

instanceId = ,

vm = 'vim.VirtualMachine:1',

msg = "vSphere Replication operation error: Virtual machine is not configured for replication.",

}

What does the vSphere Replication Appliance do?

The HBR agent runs on the ESXi server, it is responsible for sending changed data from a running virtual machine to the DR vSphere Appliance. It pushes the changes across the network to the vSphere Replication Appliance. When the vSphere Appliance receives the changes at the remote site, it applies the changes to the replica virtual machine disks.

The vSphere Replication appliance is also responsible for managing replication, which gives the administrator visibility of the virtual machine proection status. It also gives the ability to recover virtual machines with a few simple clicks.

Using the vim-cmd hbrsvc/vmreplica commands covered in section 1 we can also report the replication status and if required, force a sync directly from the host CLI. This is preferable for troubleshooting issues, as the appliance can sometimes end up out of sync with the running jobs.

In vSphere Replication, my replication jobs are being reported as not active

Checklist

Is the virtual machine powered on?
Just one replication job, or many?
Verify the replication state of the virtual machine directly on the ESXi host CLI
- If it is still active, wait for the replication job to finish then refresh the client.
- If the virtual machine replication is not in a running state, try to perform a sync using this command:
  
  # vim-cmd hbrsvc/vmreplica.sync
Check if the replication job has ever successfully completed. If not it's most likely a port issue (remember initial replication and ongoing replication use 2 separate ports). For more information, see vCenter Server and ESXi Server network port requirements for Site Recovery Manager, Port numbers that must be open for vSphere Replication

Cannot replicate virtual machine as there is another virtual machine with the same instance UUID

In some rare cases the vSphere Replication Management Server (VRMS) database may be left with replication data only at one end (primary or secondary site) and not at both as usual. This will cause further attempts to configure replication for the same virtual machine to fail.

When you try to replicate the virtual machine, you get an error message:

There is another virtual machine 'vm_name' that has the same instance UUID 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' as the one that you try to configure

To resolve duplicate instance UUID problems:

Get the VM ID:

# vim-cmd vmsvc/getallvms VMID
Check the replication state:

# vim-cmd hbrsvc/vmreplica.getState VMID
Next you need to find the replication group managed object id value. You can find this in hostd.log file on the source ESXi server by running this command:

# grep -i "Hbrsvc" /var/log/hostd.log | less
The hostd.log file entry looks similar to:

T [FFB47D20 info 'Hbrsvc'] ReplicationGroup initialized replication successfully (state=inactive) (groupID=GID-e3226007-XXXX-XXXX-XXXX-7d183d5a8c23)
Using a browser, open the VRMS Web UI (at the primary or secondary site) and log in to the Managed Object Browser (MOB) with vCenter Server administrator credentials:

https://:8043/mob/?

Notes:
- In VR 5.1 you need to add the &vmodl=1 switch to the end of the URL.
- The login and password will be : VSPHERE.LOCAL\Administrator and VC_SSO_password> respectively.
Using the GID you found in step 4. Navigate to the entry that relates to the replication job you're having issues with. This is the URL format:

https://:8043/mob/?moid=GID-xxxx
Select destroy and then click Invoke Method.
To restart the HMS service in the VR app, run this command from the VR CLI:

# service hms restart
If this does not successfully remove the GID, run this command on the primary ESXi host to remove the replica mapping:

# vim-cmd hbrsvc/vmreplica.disable VMID

In some cases, this issue can be caused a network issue on a replication site where the replication configuration of a virtual machine has been removed, this results in a lack of consistency between the 2 VR database servers. To resolve this issue, see Enabling Replication for a virtual machine may fail due to stale replication group GIDs in the VRM database (312696).

Notes:

- The VR MOB is case sensitive for the credentials passed, e.g. cloud\srm will fail while CLOUD\srm will succeed, the vCenter Server MOB is not case sensitive
- To check the exact account in use by the vCenter Server, open this URL:
  
  https://VC_IP_address/mob/?moid=SessionManager&doPath=currentSession

Cannot replicate virtual machine on a specific host

Check if the ramdisk on the host is full.

For more information, see:

RAM disk is full (316556)

What is a Recovery Point Objective (RPO)?

An RPO is the amount of time allowed to synchronize changes made on the production virtual machine to the DR virtual machine. Configuring an RPO allows the administrator to set the maximum amount of data they are prepared to lose, in a worst case scenario. The minimum value is 15 minutes.

If the configured RPO value is too short and there are many virtual machines, the WAN link between the primary site and the secondary site may become saturated and may not be able to synchronize all changes within the allotted period. This will trigger an RPO Violation message.

If there is a virtual machine that occasionally generates a huge amount data to be replicated, more than normal, RPO violations will be reported for those occasional instances. This is because the VR algorithm calculates sync start time based on the average sync time of last 5 sync jobs.

To calculate the required bandwidth, see: Calculating Bandwidth for vSphere Replication.

Full-Sync checksum optimization

Starting with VR version 5.1, full-sync checksumming is offloaded to the NFC host (DR site host). In 5.0 we had to read the data from the disk over NFC to calculate a checksum on the VR servers. However since 5.1 we just issue the request to calculate a checksum to the NFC server, so the disk IO happens on that host. The only thing sent over the network is the checksum values. This explains a performance improvement between 5.0.x and 5.1.x

How does VR snapshot consolidation work?

This is how the snapshot structure looks when a test failover is triggered.

Snapshot actions that occur when you trigger a test failover:

Snapshot delta 1 is the virtual machine snapshot that is migrated to the remote site. It is not updated after test failover begins.
Snapshot delta 2 stores all new RPO syncs
Snapshot delta 3 is used for each subsequent sync (the file inflates and then data is committed to delta 2)

Notes:

If you leave a test failover in a failed over state overnight, delta 2 could grow to a very large size by morning
If there are a lot of virtual machines in the test failover, all of them will have snapshots to consolidate, so there may be a significant performance hit.
You do not see the consolidation progress in the UI, so there is no user awareness of this ongoing task. If you need to monitor the progress of the snapshot commit, see Commands to monitor snapshot deletion in VMware ESX/ESXi.

Cleanup tasks that occur when the test failover is complete:

Clean up of file delta 1. The file is discarded
Consolidation of delta 2. All changes are committed to the base disk (including any changes written to delta 3)

Notes:

Depending on the size of the snapshot(s), consolidation can take quite some time.
During the consolidation process, additional RPO syncs will incur increased load

vCenter Server Registration

To remove the vSphere Replication registration from the vCenter Server using the MOB:

Open a browser and go to the vCenter Server MOB using the FQDN or IP address:

https:///mob
Select content > Service Content > content
Select extensionManager > ManagedObjectReference:ExtensionManager > ExtensionManager
Under Methods, select UnregisterExtension
Select the extension Key field and type:

com.vmware.vcHms
Click Invoke Method,
Refresh the page to confirm the vcHms entry has disappeared

To re-register the vSphere Replication instance with vCenter Server:

Log in to the VR server as the root user
Change to this directory:

# cd /opt/vmware/hms/libs
Run this command to re-register the appliance:

# java -jar va-util.jar -cmd certauth -host -port 80 -user -pass -extkey com.vmware.vcHms -keystore /opt/vmware/hms/security/hms-keystore.jks -keystorealias jetty -keystorepass vmware
Restart the VRMS service using this command:

# service hms restart

What do I do if the vCenter Server DB is restored and vSphere Replication is no longer working:

Test the connection from the ESXi host to the VR Appliance with netcat(nc) and confirm that the connection succeeds:

# nc -z 10.92.5.8 31031
Connection to 10.92.5.8 31031 port [tcp/*] succeeded!

For detailed instructions on using netcat, see Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi (341078).

On the production site ESXi /var/logs/vmkernel.log file check for these errors:

T cpu8:860121)WARNING: Hbr: 2783: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-92e26142-6963-4305-a79f-58dbb20a4422)
T cpu8:860121)WARNING: Hbr: 4322: Failed to establish connection to [10.92.5.8]:31031(groupID=GID-92e26142-6963-4305-a79f-58dbb20a4422): Failure

In the /opt/vmware/logs/hms/hms.log file on the VR appliance, search for ssl errors:

# grep -i "ssl" /opt/vmware/logs/hms/hms.log

You see these or similar messages:
javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.recvAlert(Unknown Source)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
at
org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:615)

Regenerate the SSL certificates on the VR appliance.
Restart the VRMS service run this command:

# service hms restart

To increase the amount of logs stored on the VR appliance:

Log in to the VR appliance as root.
Open this file on the VR appliance in a text editor:

# vi /opt/vmware/hms/conf/log4j.xml
Find this xml tag:

Edit the line and change the value to 20:
Restart the HMS service running this command:

/etc/init.d/hms restart

Cannot recover a virtual machine due to a creating failover image error message:

You see entries in the /opt/vmware/logs/hms/hms.log similar to:

ERROR hms.replica [hms-jobs-main-thread-33] (..hms.replica.CreateImageJobImpl) operationID=b6e4eb00-6a19-4a8b-9634-5611219c0985 | Error creating failover image from group instance 'RGID-c65c3526-d389-4a

99-9a8a-8eff59ac0fc1' of group 'GID-b71926c9-7a4e-4e61-93e5-27c07434f840_SECONDARY' on VR Server 'localhost.localdom' (address '127.0.0.1'), VR Server group instance id 'replica-20'.

java.lang.NullPointerException

at com.vmware.hms.util.DatastoreHelper.extractUuidFromMountPath(DatastoreHelper.java:115)

at com.vmware.hms.replication.DatastoreInfoMap.getDatastoresByUuid(DatastoreInfoMap.java:249)

at com.vmware.hms.replication.DatastoreInfoMap.getDatastoreByUuidAndDatacenter(DatastoreInfoMap.java:278)

at com.vmware.hms.replica.GroupInstanceImpl.lookupDatastoreCached(GroupInstanceImpl.java:1112)

at com.vmware.hms.replica.GroupInstanceImpl.extractVMImages(GroupInstanceImpl.java:1068)

at com.vmware.hms.replica.GroupInstanceImpl.internalCreateImage(GroupInstanceImpl.java:832)

at com.vmware.hms.replica.CreateImageJobImpl.createImageWithTask(CreateImageJobImpl.java:225)

To resolve the creating failover image error :

Browse to the problematic datastore using the vCenter Managed Object Browser (MOB)
Examine the details of all mounts (host property of the Datastore)
Check the details of mountInfo
If the value of path is unset or empty this is the problem.
Fix the broken mount and verify the issue is resolved.

SRM test failover fails with a 'passive' replication state error

Triggering a test failover in Site Recovery Manager fails with the error:

Error - VR synchronization failed for VRM group . Remote group 'GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' is in 'passive' replication state.

Open an SSH session to the ESXi host that is hosting the virtual machine
Get the VM ID:

# vim-cmd vmsvc/getallvms VMID
Force a full sync:

# vim-cmd hbrsvc/vmreplica.sync VMID
Wait for the sync to complete
Retry the test failover and confirm the issue is resolved.

Plugin deployment failures, checks and cleanup

If the vSphere Replication (VR) fails to deploy to the vCenter Server Appliance using the vSphere Web Client, you experience one or more of these symptoms:

Plugin download by the vSphere Web Client from the VR machine fails
Plugin deployment in vSphere Web Client fail

To investigate further and resolve this issue:

Check /var/log/vmware/vsphere-client/logs/vsphere_client_virgo.log for hbr messages.
To clean up any VR UI plugin files:
1. Stop the vsphere-client process using this command:
  
  /etc/init.d/vsphere-client stop (or kill -9)
2. Find all hbr related files/folders under /usr/lib - by hbr substring and then by build number (to see all files/folders):
3. Change to the /usr/lib directory:
  
  # cd /usr/lib
4. Find all hbr entries:
  
  # find . -name '*hbr*'
  
  ./vmware-vsphere-client/server/work/deployer/s/global/122/0/vr-service-6.0.0.3648226.jar/com/vmware/vr/client/hbr
  ./vmware-vsphere-client/server/work/deployer/s/global/122/0/vr-service-6.0.0.3648226.jar/com/vmware/vr/client/hbrservice
  ./vmware-vsphere-client/server/work/deployer/s/global/121/0/topology-service-6.0.0.3648226.jar/com/vmware/vr/client/hbr
5. Find all entries for the build you're using, for example:
  
  # find . -name '*3648226*'
  ./vmware-vsphere-client/server/work/deployer/s/global/122/0/vr-service-6.0.0.3648226.jar
  ./vmware-vsphere-client/server/work/deployer/s/global/123/0/vr-ui-war-6.0.0.3648226.war
  ./vmware-vsphere-client/server/work/deployer/s/global/121/0/topology-service-6.0.0.3648226.jar
  ./vmware-vsphere-client/server/work/deployer/s/global/120/0/hms-vmodl-6.0.0.3648226.jar
6. Delete all files identified by steps d and e.
7. Restart the vsphere-client process using this command:
  
  /etc/init.d/vsphere-client start

Volume Shadow Services (VSS) fails with an virtual disk (.vmdk) error:

If the file is larger than the supported size with snapshots, the VSS snapshot cannot be created on the datastore.

For details, see Creating a snapshot for a virtual machine fails with the error: File is larger than maximum file size supported.

Supportability FAQ

Can I redirect the vSphere Replication traffic to another vmkernel port.

A: No, vSphere Replication traffic in version 5.0, 5.1 and 5.5 uses the management vmkernel interface to send VR traffic to the destination VR server. In these releases, it is not supported to force the traffic over to a different vmkernel. This functionality is coming in vSphere Replication 6.0.

Useful commands:

To restart the VRMS service:

# service hms restart

To dump the contents of the HMS DB to a text file:

Log in to the VRMS
Run this command:

# /opt/vmware/vpostgres/1.0/bin/pg_dump -U vrmsdb > filename.txt

When gathering logs from the customer, this information is vital:

The HMS logs from the vSphere Replication appliance.
The ESXi host logs from the server containing the VM in question (HBR logs are tagged as Hbrsvc wit file hin hostd.log file and vmkernel.log file).
The virtual machine name and the Datastore it resides on.
Destination ESXi host logs if applicable.

For information on collecting the logs manually, see Collecting the VMware vSphere Replication logs

Additional Information

Analysing and monitoring VR port usage.

To check what ports are currently in use by VR, you can use the attached VR-Tester script. To use the script:

Download and copy Internal_2056086_VR-Tester.txt to the VRMS server
Rename to VR-Tester.sh
Make it executable: chmod +x VR-Tester.sh
Make Run the Script using this command: ./VR-Tester.sh

Further details and instructions are included in comment form within the script.

Note: If you use this script please leave a feedback comment and link your SR.

Useful netstat commands on the VR appliance(s) to test ports:

=================================
Listening On: (no line = nothing)
=================================

VAMI Web UI -- Administrator's Web Browser

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]5480$/'
netstat -ant | awk '$6 == "LISTEN" && $5 ~ /[\.:]5480$/'

Management traffic from the vSphere Replication appliance to additional vSphere Replication servers (or just itself)

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]8123$/'
netstat -ant | awk '$6 == "LISTEN" && $5 ~ /[\.:]8123$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Initial Replication

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]31031$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Ongoing Replication

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]44046$/'

=========================================
Established Sessions: (no line = nothing)
=========================================

VAMI Web UI -- Administrator's Web Browser

==VR Appliance== ==Local VC==

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]5480$/'
netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]5480$/'

Management traffic from the vSphere Replication appliance to additional vSphere Replication servers (or just itself)

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]8123$/'
netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]8123$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Initial Replication

==VR Appliance== ==Source ESXi==

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]31031$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Ongoing Replication

==VR Appliance== ==Source ESXi==

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]44046$/'

Network File Copy (NFC) connections out to the destination servers (visible during a SYNC)

==VR Appliance== ==Destination ESXi==

netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]902$/'

HTTP connection out to the destination servers (Destination ESXi's and both vCenters)

netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]80$/'

SRM documentation:

https://techdocs.broadcom.com/us/en/vmware-cis/live-recovery/site-recovery-manager/8-8.html

vSphere Replication Documentation:

https://techdocs.broadcom.com/us/en/vmware-cis/live-recovery/vsphere-replication/9-0.html

vSphere Replication Compatibility Information

https://techdocs.broadcom.com/us/en/vmware-cis/live-recovery/vsphere-replication/8-7/vr-help-plug-in-8-7.html

Introduction to vSphere Replication:
https://core.vmware.com/resource/vsphere-replication-technical-overview

Operational Limits:

Operational Limits for vSphere Replication

Operational Limits of Site Recovery Manager
vCenter Server and ESXi Server network port requirements for Site Recovery Manager, Port numbers that must be open for vSphere Replication

Creating a snapshot for an ESXi/ESX virtual machine fails with the error: File is larger than maximum file size supported
RAM disk is full
Collecting the VMware vSphere Replication logs
Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi

Attachments

Internal_2056086_VR-Tester.txt get_app

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No