Protection Groups Reporting as 'Not Configured', Resulting in VM Invalid State

Products

VMware Live Recovery

Issue/Introduction

Symptoms:

The Protection Groups (PGs) are marked as "Not Configured" in Site Recovery UI from both sites (Source and Target).

The VMs associated with these Protection Groups are also in an invalid state with error "Virtual Machine is no longer protected. The session is not authenticated".

The Site Pairing between the two sites shows as healthy on one site, but the second site reports as disconnected.

Environment

VMware Site Recovery Manager 8.x
VMware Site Recovery Manager 9.x
vSphere Replication 8.x
vSphere Replication 9.x

Cause

The root cause of this issue is traced back to a network connectivity issue between SRM and vSphere Replication.
Ping Failures between SRM and the HMS on the vSphere Replication Appliance (VR), leading to broken communication.
SRM fails to establish a consistent connection with VR services, resulting in replication disruptions and system instability.
The issue is identified by reviewing the /opt/vmware/support/logs/srm/vmware-dr.log file, where it is evident that SRM and VR could not communicate due to ping failures and subsequent connection resets. Below is a breakdown of the logs:Logs Indicating Connectivity Failures:
- Ping failures observed in the logs indicate that SRM was unable to reach the Health Monitoring Service (HMS) hosted on the vSphere Replication (VR) Appliance:
2025-05-10T10:42:43.921+05:30 warning vmware-dr[01489] [SRM@6876 sub=LocalHms connID=hms-d48f] Ping failed: "14745869678902165130"
2025-05-10T10:47:06.040+05:30 warning vmware-dr[01319] [SRM@6876 sub=LocalHms connID=hms-d48f] Ping failed: "924065267929586046"
2025-05-10T10:47:06.105+05:30 warning vmware-dr[01444] [SRM@6876 sub=vmomi.soapStub[44] connID=hms-d48f] Terminating invocation; <SSL(<io_obj p:0x00007f1f6c1ff460, h:96, <TCP '###.##.##.## : 52794'>, <TCP '###.##.##.## : 8043'>>), />, moref: hms.ReplicationManager:replication-manager, method: findReplicationGroup
2025-05-10T10:49:14.057+05:30 verbose vmware-dr[01495] [SRM@6876 sub=LocalHms connID=hms-d48f] Connect succeeded, new connection context "15506752107471633429"

These ping failures indicate that the SRM appliance was unable to reach the VR appliance for a period of around 5 minutes, during which time the connection was unreliable.
- SRM logs also show an HMS connection failure:
2025-05-10T10:44:57.209+05:30 verbose vmware-dr[01319] [SRM@6876 sub=vmomi.soapStub[27] connID=hms-d48f] Resetting stub adapter; <[N7Vmacore4Http3Ext15DrUserAgentImplE:0x00007f1f08058378], />, N2Dr5Fault22HmsConnectionDownFault9ExceptionE(Fault cause: dr.fault.HmsConnectionDownFault

This log entry signifies that the SRM server is unable to connect to the Health Monitoring Service (HMS) due to an underlying network connectivity issue.
- The timeout during DNS resolution indicates that SRM is also experiencing delays in resolving the vCenter and vSphere Replication FQDNs, further contributing to the network-related issue:
2025-05-10T10:42:43.907+05:30 warning vmware-dr[01317] [SRM@6876 sub=IO.Connection opID=9d72322f] Address resolution took too long; <resolver p:0x00007f1f7c0f5bf0, 'dr_vcenter.in:443', next:(null)>, async: true, duration: 133464msec
- Socket errors related to the broken communication pipe further confirm that the network connectivity was interrupted, leading to an explicit closure of communication channels/services:
2025-05-10T11:04:52.181+05:30 error vmware-dr[01479] [SRM@6876 sub=Listener.HTTPService opID=58322f8e-a84f-4d4c-a02b-8085b2fa9b14-loginByToken] [52614] Failed to write to response stream; <<io_obj p:0x00007f1f4c02a768, h:25, <UNIX '/run/vmware/srm/srm-socket'>, <UNIX ''>>, 52614b16-c67f-ba6c-51cf-ab7c339c03db>, N7Vmacore15SystemExceptionE(Broken pipe: The communication pipe/socket is explicitly closed by the remote service.)
- The vCenter UI logged the event "Cannot resolve the file locations of the production VM for replication", which indicates a failure in the communication between the Live Site Recovery Appliance (SRM) and the vSphere Replication Appliance (VR). This suggests that SRM was unable to locate or resolve the replication file locations for the specified production VMs.
May 10 10:44:53 DR_vcenter vpxd[6422]: Event [173530208] [1-1] [2025-05-10T05:14:53.2282Z] [vim.event.ExtendedEvent] [warning] [VSHPERE.LOCAL\SRM-36c744bd-22a4-486e-####-a0d2353aee0a] [Datacenter] [173530208] ['VM_Name' in group 'PG_Group': Cannot resolve the file locations of the production VM for replication. (###.##.##.##)]
- The log file /opt/vmware/support/logs/dr-client/dr.log from the vSphere Replication appliance indicates a loss of connectivity with the SRM appliance. The error message below suggests that the vSphere Replication appliance was unable to establish communication with the SRM server at the specified address:
2025-05-06 10:35:22,505 [srm-reactive-thread-106] WARN com.vmware.dr.ui.tools.reactive.impl.PromiseImpl 3110140686583695429 d3798ce5-d385-4043-87c2-f595f145a9e5 getPairSrmSummaryIssues - Function 'com.vmware.srm.client.infrastructure.pc.utils.PCUtil$$Lambda/0x00007f190cf89d40@730528ef' failed.
java.lang.RuntimeException: No connection to server at: https://###.##.##.##:443/drserver/vcdr/vmomi/sdk

Resolution

To resolve the issue and restore proper functionality to SRM and vSphere Replication (VR), the following actions are recommended:

Network Connectivity Review:
- Work closely with the network team to identify and resolve any connectivity issues between the SRM appliance and the vSphere Replication appliances. Common issues could involve:
  - Firewall/ACL rules blocking required ports.
  - Network congestion or latency issues affecting communication between the appliances.
  - Routing issues between the two sites, leading to disconnects.
- Capture the network traffic between the SRM and vSphere Replication appliances using a tool like tcpdump. Analyze the captured packets to identify potential anomalies, such as failed connection attempts, delayed packets, or TCP/IP retransmissions, which may indicate underlying network problems.
  
  Use the following command to collect a packet capture on the SRM and vSphere Replication appliances:
  
  # tcpdump -i eth0 -w /tmp/pkt_name.pcap
  
  This command will capture network traffic on the eth0 interface and save the output to the specified file (pkt_name.pcap). Once collected, review the capture file using tools like Wireshark to pinpoint issues.
DNS Resolution Optimization:
- Address the DNS resolution delays by ensuring that SRM and vSphere Replication components can resolve fully qualified domain names (FQDNs) within an acceptable timeframe.

Workaround (Temporary Solution):

If network issues persist and an immediate resolution is needed, restarting the srm-server.service can temporarily restore communication between the SRM and VR appliances. This action will allow the protection groups to return to a healthy state. However, this is a temporary measure, and the underlying network issue should be addressed for a permanent fix.