Trace network issues in vSphere Replication (Live Recovery)
search cancel

Trace network issues in vSphere Replication (Live Recovery)

book

Article ID: 375995

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

The network in relationship to vSphere Replication (Live Recovery)
The ESXi hosts in vCenter inventory initiate replication over port 31031 and hbr-agent 32032 to the Target Site vSphere Replication Appliance and Servers (Add-ons).

This document covers troubleshooting the network from source components to target components with the multiple VMware products configured in vSphere Replication plugin.

In the topology map below shows the direction of the required open ports which allowed access through the network hops and local routers. 
The ports 80 and 443 is related to vCenter, 902 to ESXi hosts, 8043 to HMS service, 8123 to VRS service , 31031 to vSphere Replication point-to-point, 32032 to HBR service, and 5480 to vSphere Replication VAMI.
 
Graphic design by GrantOrchard

Environment

vSphere Replication (Live Recovery) 9.0
vSphere Replication 8.0

Cause

In vSphere Replication the Security Profiles must have HBR, hbr-agent, and NFC configured.

The Security Profile is associated with network port numbers 31031 and 32032. Earlier editions of vSphere Replication used 44046 which is obsolete in vSphere Replication 8.x versions. Port 902 is ESXi host related to the NFC service. 

Two specific distinctions to make with vSphere Replication. The vSphere Replication Appliance and 9 add-on vSphere Replication Servers configuration limits.

The vSphere Replication Appliance is the primary server and the vSphere Replication Servers are the secondary servers that communicate to the primary server over port 8123.  
 

The Hosts hbr-agent service is enabled when vSphere Replication is configured to vCenter. The hbr service communicates with the vCenter's inventory and populates the vSphere Replication hbrsrv database of all ESXi hosts managed by vCenter. 



At the command line of the ESXi host use vim-cmd hbrsvc/ commands to review the details of a virtual machine running on the Host configured with vSphere Replication.

Get the  Vmid of the virtual machine


[root@esxi:~] vim-cmd vmsvc/getallvms
Vmid     Name                   File                   Guest OS          Version 
11          vmname    [datastore] vmname/vmname.vmx    OS_64Guest       vmx-21

 

These are the tools related to vSphere Replication


[root@esxi:~] vim-cmd hbrsvc/
Commands available under hbrsvc/:
vmreplica.abort              vmreplica.pause
vmreplica.create             vmreplica.queryReplicationState
vmreplica.disable            vmreplica.reconfig
vmreplica.diskDisable      vmreplica.resume
vmreplica.diskEnable       vmreplica.startOfflineInstance
vmreplica.enable              vmreplica.stopOfflineInstance
vmreplica.getConfig         vmreplica.sync
vmreplica.getState

 

Get the destination vSphere Replication Appliance or vSphere Replication Server (add-on) that the virtual machine is replicating to


[root@esxi:~] vim-cmd hbrsvc/vmreplica.getConfig 11
Retrieve VM replication configuration:
       The VM is configured for replication with the following options:
               VM Replication ID = GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
               Destination IP Address = x.x.x.x
               Destination Port = 31031
               Recovery Point Objective = 1440
               Quiesce Guest OS = false
               Enable Opportunistic Updates = false
               Network Compression = false
               Network Encryption = false
               Paused for Replication = false

               Disk scsi0:0 is configured for replication:
                              Device key = 2000
                              Replication ID = RDID-d27a4619-157e-414a-9803-427f822a4de5

To see if the virtual machine is replicating run a sync

[root@esxi:~] vim-cmd hbrsvc/vmreplica.sync 11
Force a replica synchronzation for the VM:

Get the state of the virtual machines replication in bytes of data transferring, run a few times to see progress. 


[root@esxi:~] vim-cmd hbrsvc/vmreplica.getState 11
Retrieve VM running replication state:
The VM is configured for replication. Current replication state: Group: GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (generation=9627309314105)
Group State: full sync (0% done: checksummed 0 bytes of 16 GB, transferred 0 bytes of 0 bytes)
DiskID RDID-d27a4619-157e-414a-9803-427f822a4de5 State: full sync (checksummed 0 bytes of 16 GB, transferred 0 bytes of 0 bytes)

[root@esxi:~] vim-cmd hbrsvc/vmreplica.getState 11
Retrieve VM running replication state:
The VM is configured for replication. Current replication state: Group: GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (generation=9627309314105)
Group State: full sync (60% done: checksummed 9.6 GB of 16 GB, transferred 80 KB of 392 KB)
DiskID RDID-d27a4619-157e-414a-9803-427f822a4de5 State: full sync (checksummed 9.6 GB of 16 GB, transferred 80 KB of 392 KB)

The time to transfer data over the network, related to vSphere Replication are in the ESXi Host /var/run/log/hostd.log logs.

[root@esxi:/var/run/log] cat hostd.log |less
yyyy-mm-ddThh:mm:ss.msZ info hostd[265382] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-4293-61-1-f969 user=vpxuser:VSPHERE.LOCAL\Administrator] Event 575 :
 Sync started by System for virtual machine vmname on host esxi.domain.tld in cluster Cluster_name in ha-datacenter.
...
yyyy-mm-ddThh:mm:ss.msZ info hostd[264342] [Originator@6876 sub=Hbrsvc] Replication group (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last delta duration 1136 ms, size 826908 (file transfers duration: 126 ms, prepare delta duration: 0 ms)
yyyy-mm-ddThh:mm:ss.msZ info hostd[264342] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 576 : Sync completed for virtual machine vmname on host esxi.domain.tld in cluster Cluster_name in ha-datacenter (826908 bytes transferred).
yyyy-mm-ddThh:mm:ss.msZ info hostd[265393] [Originator@6876 sub=Hbrsvc] ReplicationScheduler: stats updated for (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last duration was 1s, bandwidth was 0.79 MB/s; estimated duration is now 1s, estimated bandwidth is 0.79 MB/s.

[
root@esxi:/var/run/log] cat hostd.log |grep "last delta";date
yyyy-mm-ddThh:mm:ss.msZ info hostd[265381] [Originator@6876 sub=Hbrsvc] Replication group (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last delta duration 775 ms, size 0 (file transfers duration: 360 ms, prepare delta duration: 17 ms)
yyyy-mm-ddThh:mm:ss.msZinfo hostd[264342] [Originator@6876 sub=Hbrsvc] Replication group (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last delta duration 1136 ms, size 826908 (file transfers duration: 126 ms, prepare delta duration: 0 ms)
yyyy-mm-ddThh:mm:ss.msZ info hostd[265393] [Originator@6876 sub=Hbrsvc] Replication group (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last delta duration 1136 ms, size 826908 (file transfers duration: 126 ms, prepare delta duration: 0 ms)
yyyy-mm-ddThh:mm:ss.msZ info hostd[265381] [Originator@6876 sub=Hbrsvc] Replication group (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last delta duration 618 ms, size 0 (file transfers duration: 118 ms, prepare delta duration: 14 ms)
yyyy-mm-ddThh:mm:ss.msZ info hostd[265393] [Originator@6876 sub=Hbrsvc] Replication group (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last delta duration 618 ms, size 0 (file transfers duration: 118 ms, prepare delta duration: 14 ms)
[root@esxi:/vmfs/volumes/611a2ae5-aa26b7de-6322-00505601e86b/log] date
Day Month Day HH:MM:SS UTC YYYY

[root@esxi:/var/run/log] cat hostd.log |grep "last duration"
yyyy-mm-ddThh:mm:ss.msZ info hostd[265381] [Originator@6876 sub=Hbrsvc] ReplicationScheduler: stats updated for (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last duration was 1s, bandwidth was 0.00 MB/s; estimated duration is now 1s, estimated bandwidth is 1.00 MB/s.
yyyy-mm-ddThh:mm:ss.msZ info hostd[265393] [Originator@6876 sub=Hbrsvc] ReplicationScheduler: stats updated for (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last duration was 1s, bandwidth was 0.79 MB/s; estimated duration is now 1s, estimated bandwidth is 0.79 MB/s.
yyyy-mm-ddThh:mm:ss.msZ info hostd[265393] [Originator@6876 sub=Hbrsvc] ReplicationScheduler: stats updated for (groupID=GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): last duration was 1s, bandwidth was 0.00 MB/s; estimated duration is now 1s, estimated bandwidth is 0.79 MB/s.

Resolution

To track the vSphere Replication network. Replace the x.x.x.x with the your actual IP's in your environment. For this example the last octet is provided to understand direction of packets with capturing network data.

  • On the ESXi host use the command nc -z ip_address port                                                                IP's are source x.x.x.6 to destination x.x.x.12
  • On the vSphere Replication Appliance use the command curl -vvv telnet://fqdn|ip_address:port    IP's are source x.x.x.5 to destination x.x.x.3
  • on vCenter use the command  curl -v telnet://fqdn|ip_address:port                                                  IP's are source x.x.x.4 to destination x.x.x.2


From the vCenter command line


Name: vc_fqdn.domain.tld
Address: x.x.x.4

From vCenter (IP x.x.x.4) to local vSphere Replication on port 8043
root@vc [ ~ ]# curl -v telnet://x.x.x.5:8043
* Rebuilt URL to: telnet://x.x.x.5:8043/
* Trying x.x.x.5...
* TCP_NODELAY set
* Connected to x.x.x.5 (x.x.x.5) port 8043 (#0)
 
From vCenter to remote vCenter over port 443
root@vc [ ~ ]# curl -v telnet://x.x.x.2:443
* Rebuilt URL to: telnet://x.x.x.2:443/
*   Trying x.x.x.2...
* TCP_NODELAY set
* Connected to x.x.x.2 (x.x.x.2) port 443 (#0)
 
From the ESXi Host command line

Name: Host.domain.tld
Address: x.x.x.6
From Host to local vSphere Replication on port 80
[root@Host:~] nc -z .x.x.x.5
Connection to x.x.x.5 80 port [tcp/http] succeeded!
 
From Host to remote vSphere Replication on port 31031
[root@Host:~]  nc -z x.x.x.3  31031
Connection to x.x.x.3 31031 port [tcp/*] succeeded!

 
From the vSphere Replication command line - repeat on the paired site vSphere Replication
 
Name: vr.domain.tld
Address: x.x.x.5
From vSphere Replication to local Host on port 902
root@vr [ ~ ]# curl -v telnet://x.x.x.6:902
*   Trying x.x.x.6:902...
* Connected to x.x.x.6 (x.x.x.6) port 902 (#0)
---
From vSphere Replication to local vCenter on port 80 and 443
root@vr [ ~ ]# curl -v telnet://x.x.x.4:80
* Rebuilt URL to: telnet://x.x.x.4:80/
* Trying x.x.x.4...
* TCP_NODELAY set
* Connected to x.x.x.4 (x.x.x.4) port 80 (#0)
 
root@vr [ ~ ]# curl -v telnet://x.x.x.4:443
* Rebuilt URL to: telnet://x.x.x.4:443/
* Trying x.x.x.4...
* TCP_NODELAY set
* Connected to x.x.x.4 (x.x.x.4) port 443 (#0)
---
From vSphere Replication to remote vCenter on port 80 and 443
root@wvr [ ~ ]# curl -v telnet://x.x.x.2:80
*   Trying x.x.x.2:80...
* Connected to x.x.x.2 (x.x.x.2) port 80 (#0)
 
root@vr [ ~ ]# curl -v telnet://x.x.x.2:443
*   Trying x.x.x.2:443...
* Connected to x.x.x.2 (x.x.x.2) port 443 (#0)
 
To check for IP conflict on the vSphere Replication command line 

root@vr [ ~ ]# ifconfig -a 
root@wvr [ ~ ]# nslookup <IP address>

Additional Information

For the purpose of this information vSphere Replication Appliances are vr1 and vr2.

Isolating the Network Traffic of vSphere Replication KB 78613

Login to the vSphere Replication Appliance command line (putty session)
If you see 10-eth0.network. The ESXi host is using the default vmk0 to replicate data.

root@vr1 [ ~ ]# cd /etc/systemd/network
root@vr1 [ /etc/systemd/network ]# ls -l  
-rw-r--r-- 1 root root 197 Jan 13 19:20 10-eth0.network

If you see 10-eth1.network and 10-eth2.network. There is a dedicated replication configuration on the ESXi hosts. 

  • eth1:Incoming traffic (from source hosts to appliance)
  • eth2:Outgoing traffic (from appliance to target hosts)

root@vr1 [ /etc/systemd/network ]# ls -l
-rw-r--r-- 1 root root 119 Jan 13 19:20 10-eth0.network -> Management
-rw-r--r-- 1 root root 117 Jan 13 19:20 10-eth1.network  -> VR Traffic
-rw-r--r-- 1 root root 117 Jan 13 19:20 10-eth2.network  -> VR NFC Traffic

Check the current arp/network tables. The ports 8043 are vSphere Replication point-to-point of the paired configuration between vr1 and vr2.

root@vr1 [ /etc/systemd/network ]# netstat –r |egrep -i "State|8043"
Proto Recv-Q         Send-Q      Local Address            Foreign Address              State
tcp6    0              0         vr1.domain.tld:33040     vr2.domain.tld:8043      ESTABLISHED
tcp6    0              0         vr1.domain.tld:36796     vr1.domain.tld:8043      ESTABLISHED
tcp6    0              0         vr1.domain.tld:8043      srm1.domain.tld:50316    ESTABLISHED
tcp6    0              0         vr1.domain.tld:8043      vr1.domain.tld:36796     ESTABLISHED
tcp6    0              0         vr1.domain.tld:8043      vr2.domain.tld:46902     ESTABLISHED

On the ESXi host . Identify and make a note of these parameters PortNum, ClientName, and tmp directory related to the replication information you discovered in the vCenter UI for the ESXi host.

Login to the ESXi Host where the vSphere Replicated VM resides to get the destination vSphere Replication IP and port number.
 
run command: vim-cmd getallvms |grep vm_name
Vmid     Name                   File                   Guest OS          Version 
11          vmname    [datastore] vmname/vmname.vmx    OS_64Guest       vmx-21
 
use the vmid to run command syntax: vim-cmd hbrsvc/vmreplica.getConfig <vmid>
 
ex: vim-cmd hbrsvc/vmreplica.getConfig 11
Retrieve VM replication configuration:
        The VM is configured for replication with the following options:
                VM Replication ID = GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
                Destination IP Address = x.x.x.x <--- vSphere Replication target IP, or paired vSphere Replication
                Destination Port = 31031 <--- vSphere Replication target IP port
                Recovery Point Objective = 1440
                Quiesce Guest OS = true
                Enable Opportunistic Updates = false
                Network Compression = false
                Network Encryption = false
                Paused for Replication = false
 
               Disk scsi0:0 is configured for replication:
                        Device key = 2000
                        Replication ID = RDID--d27a4619-157e-414a-9803-427f822a4de5

Login to the ESxi host where the vSphere Replication Appliance resides/running.
 
[root@esxi:/tmp] which net-stats
/bin/net-stats
[root@esxi:/tmp]# net-stats -l
PortNum       Type      SubType     SwitchName    MACAddress          ClientName
2214592523     4          0         vSwitch0      xx:xx:xx:xx:xx:xx     vmnic0 <--- for default vmk0 the replication is on this uplink
 
 
How to find the uplink the replication VM is using on the ESXi host
 
[root@esxi:/tmp]# esxcli network vm list
World ID   Name          Num Ports    Networks
--------     ----------      ---------         --------
265948         vr1              1               VM Network - Management
 
[root@esxi:/tmp]# esxcli network vm port list -w 265948
Port ID: 67108899
vSwitch: vSwitch0
Portgroup: VM Network - Management
DVPort ID:
MAC Address: xx:xx:xx:xx:xx:xx
IP Address: x.x.x.x
Team Uplink: vmnic0 <---------- vsphere replication VM vr1 is using vmnic0
Uplink Port ID: 2214592523
Active Filters:
 

Using the pktcap-uw tool in ESXi 5.5 and later KB 2051814

The vmnic is the uplink and the vmk is the kernel port. The PortNum is the virtual switch port id for the uplink. 

To capture packets run the pktcap-uw command at both sites simultaneously: you will need to edit the  switch port id for the uplink and vmnic (221459252 and vmnic0) based on the customer's configuration found for replication. 

[root@esxi:/tmp]# pktcap-uw --switchport 2214592523 -o /tmp/2214592523.pcap & pktcap-uw --uplink vmnic0 -o /tmp/vmnic0.pcap &
or 
[root@esxi:/tmp]# pktcap-uw --trace --ip destination_ip > ip.pcap &
or replace X with vmnic number
[root@esxi:/tmp]# pktcap-uw --dir 2 --uplink vmnicX -o -| tcpdump-uw icmp -enr -

 

You can stop pktcap-uw tracing with the kill command:
 kill $(lsof |grep pktcap-uw |awk '{print $1}'| sort -u)

Run this command to check that all pktcap-uw traces are stopped:
lsof |grep pktcap-uw |awk '{print $1}'| sort -u

 

To read the packet capture live or upload the pcap files and/or use wireshark (download | open pcap file | work with a pcap file).
[root@esxi:/tmp]# tcpdump-uw -ttttnnr 2214592523.pcap |grep 31031