Multiple active replications experience RPO violations in VMware Cloud Director Availability 4.x

search cancel

Multiple active replications experience RPO violations in VMware Cloud Director Availability 4.x

book

Article ID: 315055

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

Symptoms:

In the VMware Cloud Director Availability Provider Portal, multiple replications display RPO Violation notifications.
In /opt/vmware/h4/repliactor/log/replicator.log on the destination Replicator, you see entries similar to:

2020-07-01 02:56:33.958 WARN - [########-####-####-####-########35bf] [hbr-poller1] c.v.h.r.m.hbr.DestinationGroupMonitor : The following replication in our internal state is unknown to hbrsrv: H4-########-####-####-####-########810a
2020-07-01 02:56:33.958 WARN - [########-####-####-####-########35bf] [hbr-poller1] c.v.h.r.m.hbr.DestinationGroupMonitor : The following replication in our internal state is unknown to hbrsrv: H4-########-####-####-####-########5e60

In /opt/vmware/h4/lwdproxy/log/lwdproxy.log on the destination Replicator, you see entries similar to:

2020-07-01 02:56:34,218 WARN [Worker-3-1] c.v.h.p.h.InitSessionHandler [InitSessionHandler.java:202] Outbound connect failed for H4-########-####-####-####-########810a
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.1.71:31031
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
...

In /var/log/vmware/hbrsrv.log on the destination Replciator, you see entries similar to:

2020-07-01T02:56:27.060+01:00 verbose hbrsrv[05778] [Originator@6876 sub=RemoteDisk] CID changed for disk (/vmfs/volumes/########-####-########a62b/C4-########-####-####-####-########2cce/########-####-####-####-########55ee_cuxxxxxn.vmdk) (prev='99d470046a7aaa53a27faf4c58fc74c1') (current=1eff25fbb42e33563ae78b5bd2f601ed)
2020-07-01T02:56:27.060+01:00 warning hbrsrv[05704] [Originator@6876 sub=NfcConnection] Thread started
2020-07-01T02:56:27.063+01:00 panic hbrsrv[05452] [Originator@6876 sub=Default]
-->
--> Panic: Received fatal signal.
--> Backtrace:
--> [backtrace begin] product: VMware vSphere Replication Server, version: 7.0.0, build: build-16104671, tag: hbrsrv, cpu: x86_64, os: linux, buildType: release
--> backtrace[00] hbrsrv-bin[0x0098FE7F]
--> backtrace[01] hbrsrv-bin[0x009835E9]
--> backtrace[02] hbrsrv-bin[0x00A85893]
...

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Cloud Director Availability 4.x

Cause

This issue occurs due an asynchronous NFC setting being enabled in the HBR service on the Replicator appliance resulting in performance degredation and causing the HBR service to crash.

Resolution

To resolve this issue, disable the asynchronous NFC setting for the HBR service:

Note: This procedure modifies a configuration file. Ensure to take a backup of the file before proceeding.

SSH to the Replicator appliance and log in as root.
Navigate to the following directory:

cd /etc/vmware

Determine if the async NFC setting is explicitly enabled:

grep useAsyncNfc hbrsrv.xml

If the line is present, edit the hbrsrv.xml file and set it to false.

<useAsyncNfc>false</useAsyncNfc>

If the line isn't present, add it in the <hbrsrv> section under  of the hbrsrv.xml file:

<config>
   ...
   
   <hbrsrv>
      ...
      
      <useAsyncNfc>false</useAsyncNfc>

Restart the HBR service for the changes to be applied:

systemctl restart hbrsrv.service

Feedback

thumb_up Yes

thumb_down No