vSphere Replication Troubleshooting
search cancel

vSphere Replication Troubleshooting

book

Article ID: 308573

calendar_today

Updated On:

Products

VMware Live Recovery VMware vSphere ESXi VMware Cloud on AWS

Issue/Introduction

The purpose of this article is to assist with the troubleshooting of vSphere Replication issues and contains many frequently asked questions.


Resolution

What is Host Based Replication (HBR)?

 
There is a filter installed on 5.x and above ESXi servers, the HBR filter. Its purpose is to push VM replication data to the vSphere Replication Appliance(s). You can see this filter by running this command:

# vmkload_mod -l

It will return this output:

hbr_filter
 
As this filter runs on the ESXi server, it is important to note that the HBR filter uses hostd resources, and it has its own command set within vim-cmd

Note: Currently it does not yet have an esxcli equivalent
 
These commands are available on to manage HBR on an ESXi host
 
The commands under hbrsvc/ are:
 
vmreplica.abort
vmreplica.create
vmreplica.disable
vmreplica.diskDisable
vmreplica.diskEnable
vmreplica.enable
vmreplica.getConfig
vmreplica.getState
vmreplica.pause
vmreplica.queryReplicationState
vmreplica.reconfig
vmreplica.resume
vmreplica.startOfflineInstance
vmreplica.stopOfflineInstance
vmreplica.sync
 
Usage:
 
To use these commands, you must first acquire the VM ID of the virtual machine you wish to troubleshoot.
 
The VM ID is used to uniquely identify this virtual machine on an ESXi host.
 
To get the VM ID, run this command:
 
# vim-cmd vmsvc/getallvms

Once you have the VM ID you can query the replication state of a chosen virtual machine by running this command:

# vim-cmd hbrsvc/vmreplica.getState 1

In the example below the VM ID is 1.
 
/vmfs/volumes/51cb2399-2692ecca-8682-000c299d035f/VM # vim-cmd hbrsvc/vmreplica.getState 1

If replication is not configured on this virtual machine, you see an output similar to:

Retrieve VM running replication state:
(vim.fault.ReplicationVmFault) {
dynamicType = ,
faultCause = (vmodl.MethodFault) null,
reason = "notConfigured",
state = ,
instanceId = ,
vm = 'vim.VirtualMachine:1',
msg = "vSphere Replication operation error: Virtual machine is not configured for replication.",
}
 
What does the vSphere Replication Appliance do?
 
The HBR agent runs on the ESXi server, it is responsible for sending changed data from a running virtual machine to the DR vSphere Appliance. It pushes the changes across the network to the vSphere Replication Appliance. When the vSphere Appliance receives the changes at the remote site, it applies the changes to the replica virtual machine disks.
 
The vSphere Replication appliance is also responsible for managing replication, which gives the administrator visibility of the virtual machine proection status. It also gives the ability to recover virtual machines with a few simple clicks.

Using the vim-cmd hbrsvc/vmreplica commands covered in section 1 we can also report the replication status and if required, force a sync directly from the host CLI. This is preferable for troubleshooting issues, as the appliance can sometimes end up out of sync with the running jobs.
 
 
In vSphere Replication, my replication jobs are being reported as not active
 
Checklist
  • Is the virtual machine powered on?
  • Just one replication job, or many?
  • Verify the replication state of the virtual machine directly on the ESXi host CLI
    • If it is still active, wait for the replication job to finish then refresh the client.
    • If the virtual machine replication is not in a running state, try to perform a sync using this command:

      # vim-cmd hbrsvc/vmreplica.sync
       
  • Check if the replication job has ever successfully completed. If not it's most likely a port issue (remember initial replication and ongoing replication use 2 separate ports). For more information, see vCenter Server and ESXi Server network port requirements for Site Recovery Manager, Port numbers that must be open for vSphere Replication 
 
Cannot replicate virtual machine as there is another virtual machine with the same instance UUID
 
In some rare cases the vSphere Replication Management Server (VRMS) database may be left with replication data only at one end (primary or secondary site) and not at both as usual. This will cause further attempts to configure replication for the same virtual machine to fail.
 
When you try to replicate the virtual machine, you get an error message:
 
There is another virtual machine 'vm_name' that has the same instance UUID 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' as the one that you try to configure
 
To resolve duplicate instance UUID problems:
  1. Get the VM ID:

    # vim-cmd vmsvc/getallvms VMID
     
  2. Check the replication state:

    # vim-cmd hbrsvc/vmreplica.getState VMID
     
  3. Next you need to find the replication group managed object id value. You can find this in hostd.log file on the source ESXi server by running this command:

    # grep -i "Hbrsvc" /var/log/hostd.log | less
     
  4. The hostd.log file entry looks similar to:

    T [FFB47D20 info 'Hbrsvc'] ReplicationGroup initialized replication successfully (state=inactive) (groupID=GID-e3226007-XXXX-XXXX-XXXX-7d183d5a8c23)
     
  5. Using a browser, open the VRMS Web UI (at the primary or secondary site) and log in to the Managed Object Browser (MOB) with vCenter Server administrator credentials:

    https://:8043/mob/?

    Notes:
    • In VR 5.1 you need to add the &vmodl=1 switch to the end of the URL.
    • The login and password will be : VSPHERE.LOCAL\Administrator and VC_SSO_password> respectively.
       
  6. Using the GID you found in step 4. Navigate to the entry that relates to the replication job you're having issues with. This is the URL format:

    https://:8043/mob/?moid=GID-xxxx
     
  7. Select destroy and then click Invoke Method.
     
  8. To restart the HMS service in the VR app, run this command from the VR CLI:

    # service hms restart
     
  9. If this does not successfully remove the GID, run this command on the primary ESXi host to remove the replica mapping:

    # vim-cmd hbrsvc/vmreplica.disable VMID

In some cases, this issue can be caused a network issue on a replication site where the replication configuration of a virtual machine has been removed, this results in a lack of consistency between the 2 VR database servers. To resolve this issue, see Enabling Replication for a virtual machine may fail due to stale replication group GIDs in the VRM database (312696).

 
Notes:
    • The VR MOB is case sensitive for the credentials passed, e.g. cloud\srm will fail while CLOUD\srm will succeed, the vCenter Server MOB is not case sensitive
    • To check the exact account in use by the vCenter Server, open this URL:

      https://VC_IP_address/mob/?moid=SessionManager&doPath=currentSession
Cannot replicate virtual machine on a specific host
 
Check if the ramdisk on the host is full.

For more information, see:

 

 
What is a Recovery Point Objective (RPO)?
 
An RPO is the amount of time allowed to synchronize changes made on the production virtual machine to the DR virtual machine. Configuring an RPO allows the administrator to set the maximum amount of data they are prepared to lose, in a worst case scenario. The minimum value is 15 minutes.

If the configured RPO value is too short and there are many virtual machines, the WAN link between the primary site and the secondary site may become saturated and may not be able to synchronize all changes within the allotted period. This will trigger an RPO Violation message.
 
If there is a virtual machine that occasionally generates a huge amount data to be replicated, more than normal, RPO violations will be reported for those occasional instances. This is because the VR algorithm calculates sync start time based on the average sync time of last 5 sync jobs.
 
To calculate the required bandwidth, see: Calculating Bandwidth for vSphere Replication.

Full-Sync checksum optimization
 
Starting with VR version 5.1, full-sync checksumming is offloaded to the NFC host (DR site host). In 5.0 we had to read the data from the disk over NFC to calculate a checksum on the VR servers. However since 5.1 we just issue the request to calculate a checksum to the NFC server, so the disk IO happens on that host. The only thing sent over the network is the checksum values. This explains a performance improvement between 5.0.x and 5.1.x
 
How does VR snapshot consolidation work?

This is how the snapshot structure looks when a test failover is triggered.
 
 
 
Snapshot actions that occur when you trigger a test failover:
  1. Snapshot delta 1 is the virtual machine snapshot that is migrated to the remote site. It is not updated after test failover begins.
  2. Snapshot delta 2 stores all new RPO syncs
  3. Snapshot delta 3 is used for each subsequent sync (the file inflates and then data is committed to delta 2)
Notes:
  • If you leave a test failover in a failed over state overnight, delta 2 could grow to a very large size by morning
  • If there are a lot of virtual machines in the test failover, all of them will have snapshots to consolidate, so there may be a significant performance hit.
  • You do not see the consolidation progress in the UI, so there is no user awareness of this ongoing task. If you need to monitor the progress of the snapshot commit, see Commands to monitor snapshot deletion in VMware ESX/ESXi.
Cleanup tasks that occur when the test failover is complete:
 
  1. Clean up of file delta 1. The file is discarded
  2. Consolidation of delta 2. All changes are committed to the base disk (including any changes written to delta 3)
Notes:
  • Depending on the size of the snapshot(s), consolidation can take quite some time.
  • During the consolidation process, additional RPO syncs will incur increased load
vCenter Server Registration
 
To remove the vSphere Replication registration from the vCenter Server using the MOB:
 
  1. Open a browser and go to the vCenter Server MOB using the FQDN or IP address:

    https:///mob
     
  2. Select content > Service Content > content
  3. Select extensionManager > ManagedObjectReference:ExtensionManager > ExtensionManager
  4. Under Methods, select UnregisterExtension
  5. Select the extension Key field and type:

    com.vmware.vcHms
     
  6. Click Invoke Method,
  7. Refresh the page to confirm the vcHms entry has disappeared
 
To re-register the vSphere Replication instance with vCenter Server:
  1. Log in to the VR server as the root user
  2. Change to this directory:

    # cd /opt/vmware/hms/libs
     
  3. Run this command to re-register the appliance:

    # java -jar va-util.jar -cmd certauth -host -port 80 -user -pass -extkey com.vmware.vcHms -keystore /opt/vmware/hms/security/hms-keystore.jks -keystorealias jetty -keystorepass vmware
     
  4. Restart the VRMS service using this command:

    # service hms restart
 
 
What do I do if the vCenter Server DB is restored and vSphere Replication is no longer working:
  • On the production site ESXi /var/logs/vmkernel.log file check for these errors:

    T cpu8:860121)WARNING: Hbr: 2783: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-92e26142-6963-4305-a79f-58dbb20a4422)
    T cpu8:860121)WARNING: Hbr: 4322: Failed to establish connection to [10.92.5.8]:31031(groupID=GID-92e26142-6963-4305-a79f-58dbb20a4422): Failure
  • In the /opt/vmware/logs/hms/hms.log file on the VR appliance, search for ssl errors:

    # grep -i "ssl" /opt/vmware/logs/hms/hms.log

    You see these or similar messages:
    javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
    at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
    at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.recvAlert(Unknown Source)
    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source)
    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source)
    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
    at
    org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:615)
  • Regenerate the SSL certificates on the VR appliance.
     
  • Restart the VRMS service run this command:

    # service hms restart
 
To increase the amount of logs stored on the VR appliance:
 
  1. Log in to the VR appliance as root.
  2. Open this file on the VR appliance in a text editor:

    # vi /opt/vmware/hms/conf/log4j.xml
     
  3. Find this xml tag:

    Edit the line and change the value to 20:


     
  4. Restart the HMS service running this command:

    /etc/init.d/hms restart

Cannot recover a virtual machine due to a creating failover image error message:
 
You see entries in the /opt/vmware/logs/hms/hms.log similar to:
 
ERROR hms.replica [hms-jobs-main-thread-33] (..hms.replica.CreateImageJobImpl) operationID=b6e4eb00-6a19-4a8b-9634-5611219c0985 | Error creating failover image from group instance 'RGID-c65c3526-d389-4a
99-9a8a-8eff59ac0fc1' of group 'GID-b71926c9-7a4e-4e61-93e5-27c07434f840_SECONDARY' on VR Server 'localhost.localdom' (address '127.0.0.1'), VR Server group instance id 'replica-20'.
java.lang.NullPointerException
at com.vmware.hms.util.DatastoreHelper.extractUuidFromMountPath(DatastoreHelper.java:115)
at com.vmware.hms.replication.DatastoreInfoMap.getDatastoresByUuid(DatastoreInfoMap.java:249)
at com.vmware.hms.replication.DatastoreInfoMap.getDatastoreByUuidAndDatacenter(DatastoreInfoMap.java:278)
at com.vmware.hms.replica.GroupInstanceImpl.lookupDatastoreCached(GroupInstanceImpl.java:1112)
at com.vmware.hms.replica.GroupInstanceImpl.extractVMImages(GroupInstanceImpl.java:1068)
at com.vmware.hms.replica.GroupInstanceImpl.internalCreateImage(GroupInstanceImpl.java:832)
at com.vmware.hms.replica.CreateImageJobImpl.createImageWithTask(CreateImageJobImpl.java:225)
 
To resolve the creating failover image error :
  1. Browse to the problematic datastore using the vCenter Managed Object Browser (MOB)
  2. Examine the details of all mounts (host property of the Datastore)
  3. Check the details of mountInfo
  4. If the value of path is unset or empty this is the problem.
  5. Fix the broken mount and verify the issue is resolved.
 
SRM test failover fails with a 'passive' replication state error
 
Triggering a test failover in Site Recovery Manager fails with the error:


Error - VR synchronization failed for VRM group . Remote group 'GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' is in 'passive' replication state.
 
  1. Open an SSH session to the ESXi host that is hosting the virtual machine
  2. Get the VM ID:

    # vim-cmd vmsvc/getallvms VMID
     
  3. Force a full sync:

    # vim-cmd hbrsvc/vmreplica.sync VMID
     
  4. Wait for the sync to complete
  5. Retry the test failover and confirm the issue is resolved.
Plugin deployment failures, checks and cleanup

If the vSphere Replication (VR) fails to deploy to the vCenter Server Appliance using the vSphere Web Client, you experience one or more of these symptoms:
  • Plugin download by the vSphere Web Client from the VR machine fails
  • Plugin deployment in vSphere Web Client fail
To investigate further and resolve this issue:
  1. Check /var/log/vmware/vsphere-client/logs/vsphere_client_virgo.log for hbr messages.
  2. To clean up any VR UI plugin files:
    1. Stop the vsphere-client process using this command:

      /etc/init.d/vsphere-client stop (or kill -9)
       
    2. Find all hbr related files/folders under /usr/lib - by hbr substring and then by build number (to see all files/folders):
    3. Change to the /usr/lib directory:

      # cd /usr/lib
       
    4. Find all hbr entries:

      # find . -name '*hbr*'

      ./vmware-vsphere-client/server/work/deployer/s/global/122/0/vr-service-6.0.0.3648226.jar/com/vmware/vr/client/hbr
      ./vmware-vsphere-client/server/work/deployer/s/global/122/0/vr-service-6.0.0.3648226.jar/com/vmware/vr/client/hbrservice
      ./vmware-vsphere-client/server/work/deployer/s/global/121/0/topology-service-6.0.0.3648226.jar/com/vmware/vr/client/hbr

       
    5. Find all entries for the build you're using, for example:

      # find . -name '*3648226*'
      ./vmware-vsphere-client/server/work/deployer/s/global/122/0/vr-service-6.0.0.3648226.jar
      ./vmware-vsphere-client/server/work/deployer/s/global/123/0/vr-ui-war-6.0.0.3648226.war
      ./vmware-vsphere-client/server/work/deployer/s/global/121/0/topology-service-6.0.0.3648226.jar
      ./vmware-vsphere-client/server/work/deployer/s/global/120/0/hms-vmodl-6.0.0.3648226.jar

       
    6. Delete all files identified by steps d and e.
    7. Restart the vsphere-client process using this command:

      /etc/init.d/vsphere-client start
 
Volume Shadow Services (VSS) fails with an virtual disk (.vmdk) error:
 
If the file is larger than the supported size with snapshots, the VSS snapshot cannot be created on the datastore.
Supportability FAQ
 
Can I redirect the vSphere Replication traffic to another vmkernel port.
 
A: No, vSphere Replication traffic in version 5.0, 5.1 and 5.5 uses the management vmkernel interface to send VR traffic to the destination VR server. In these releases, it is not supported to force the traffic over to a different vmkernel. This functionality is coming in vSphere Replication 6.0.
 
Useful commands:
 
To restart the VRMS service:
 
# service hms restart
 
To dump the contents of the HMS DB to a text file:
  1. Log in to the VRMS
  2. Run this command:

    # /opt/vmware/vpostgres/1.0/bin/pg_dump -U vrmsdb > filename.txt
When gathering logs from the customer, this information is vital:
 
  • The HMS logs from the vSphere Replication appliance.
  • The ESXi host logs from the server containing the VM in question (HBR logs are tagged as Hbrsvc wit file hin hostd.log file and vmkernel.log file).
  • The virtual machine name and the Datastore it resides on.
  • Destination ESXi host logs if applicable.
For information on collecting the logs manually, see Collecting the VMware vSphere Replication logs
 

Additional Information

Analysing and monitoring VR port usage.

 
To check what ports are currently in use by VR, you can use the attached VR-Tester script. To use the script:
  1. Download and copy Internal_2056086_VR-Tester.txt to the VRMS server
  2. Rename to VR-Tester.sh
  3. Make it executable: chmod +x VR-Tester.sh
  4. Make Run the Script using this command: ./VR-Tester.sh
Further details and instructions are included in comment form within the script.

Note: If you use this script please leave a feedback comment and link your SR.
 

Useful netstat commands on the VR appliance(s) to test ports:


=================================
Listening On: (no line = nothing)
=================================

VAMI Web UI -- Administrator's Web Browser

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]5480$/'
netstat -ant | awk '$6 == "LISTEN" && $5 ~ /[\.:]5480$/'


Management traffic from the vSphere Replication appliance to additional vSphere Replication servers (or just itself)

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]8123$/'
netstat -ant | awk '$6 == "LISTEN" && $5 ~ /[\.:]8123$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Initial Replication

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]31031$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Ongoing Replication

netstat -ant | awk '$6 == "LISTEN" && $4 ~ /[\.:]44046$/'

=========================================
Established Sessions: (no line = nothing)
=========================================

VAMI Web UI -- Administrator's Web Browser

==VR Appliance== ==Local VC==

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]5480$/'
netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]5480$/'


Management traffic from the vSphere Replication appliance to additional vSphere Replication servers (or just itself)

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]8123$/'
netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]8123$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Initial Replication

==VR Appliance== ==Source ESXi==

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]31031$/'

From the ESXi host at the protected site to the vSphere Replication appliance on the recovery site -- Ongoing Replication

==VR Appliance== ==Source ESXi==

netstat -ant | awk '$6 == "ESTABLISHED" && $4 ~ /[\.:]44046$/'

Network File Copy (NFC) connections out to the destination servers (visible during a SYNC)

==VR Appliance== ==Destination ESXi==

netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]902$/'

HTTP connection out to the destination servers (Destination ESXi's and both vCenters)

netstat -ant | awk '$6 == "ESTABLISHED" && $5 ~ /[\.:]80$/'

SRM documentation:
 
vSphere Replication Documentation:
 
 
 
Operational Limits:

Attachments

Internal_2056086_VR-Tester.txt get_app