Existing replications fail to synchronize after upgrading the Cloud Director Availability 4.x On-Premises Appliance
search cancel

Existing replications fail to synchronize after upgrading the Cloud Director Availability 4.x On-Premises Appliance

book

Article ID: 315009

calendar_today

Updated On:

Products

VMware Cloud Director VMware Cloud Director Availability - Disaster Recovery 4.x

Issue/Introduction

Symptoms:

  • Replications configured before the upgrade of the Cloud Director Availability On-Premises Appliance fail to synchronize.

    • On Destination site, the job's state is kept "Sychronizing"
    • On Source site, in /opt/vmware/h4/cloud/log/cloud.log, there is SyncTimeoutException as below:

      com.vmware.h4.manager.api.exceptions.SyncTimeoutException: Sync timeout for replication 'C4-########-####-####-####-############'

  • Configuring new replications from the upgraded Cloud Director Availability On-Premises Appliance synchronize successfully.
  • In the /opt/vmware/h4/replicator/log/replicator.log file on the Cloud Replicator Appliance, you see entries similar to:

    yyyy-mm-dd hh:mm:ss,sss DEBUG - [########-####-####-####-###########-##] [job-##] c.v.h.r.replication.SyncSourceJob : Requesting manual sync for H4-########-####-####-####-###########

    yyyy-mm-dd hh:mm:ss,sss  DEBUG - [########-####-####-####-##########-##] [job-##] c.v.h.r.replication.SyncSourceJob : Requesting online instance for vm vm-##
  • In the /opt/vmware/h4/lwdproxy/log/lwdproxy.log file on the Cloud Director Availability On-Premises Appliance, you see entries similar to:
    yyyy-mm-dd hh:mm:ss,sss INFO [Worker-3-3] c.v.h.p.h.InitSessionHandler [InitSessionHandler.java:74] PeerId: null
    yyyy-mm-dd hh:mm:ss,sss WARN [Worker-3-3] c.v.h.p.h.InitSessionHandler [InitSessionHandler.java:224] Handshake relay to server /127.0.0.1:8049 failed for group H4-########-####-####-####-############

    javax.net.ssl.SSLException: SSLEngine closed already
            at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:848)
            at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:811)

    yyyy-mm-dd hh:mm:ss,sss WARN [Worker-3-3] c.v.h.p.u.TrafficCounter [TrafficCounter.java:65] Unknown counter: ################-########-####-####-####-############-########

    yyyy-mm-dd hh:mm:ss,sss WARN [Worker-3-3] i.n.c.DefaultChannelPipeline [DefaultChannelPipeline.java:1152] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown       
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)       
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)


  • In the /opt/vmware/h4/replicator/log/replicator.log file on the Cloud Director Availability On-Premises Appliance, you see entries similar as below containing "FOUND_ONGOING_INSTANCE":

    yyyy-mm-dd hh:mm:ss,sss DEBUG - [UI-########-####-####-####-############-#####-##-##-##] [pc-task-monitor-2] com.vmware.h4.jobengine.JobExecution     : Task 8e7331bd-bf8d-4ff2-bdb4-6481d50f38ff (WorkflowInfo{type='sync', resourceType='replication', resourceId='H4-########-####-####-####-############', isPrivate=false, resourceName='null'}) completed with result SyncRequestResult{instanceKey='replica-########-####-####-####-############', result=FOUND_ONGOING_INSTANCE}


Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Cloud Director Availability 4.x

Cause

This issue occurs when the lightweight delta service certificate on the Cloud Director Availability On-Premises Appliance is regenerated due to the upgrade of the appliance, but this change doesn't get updated on the cloud site for existing replications.

Resolution

This is a known issue affecting Cloud Director Availability 4.x.

Currently there is no resolution.

Workaround:
To work around this issue, you need to reconfigure affected replications. To do this in the least disruptive way, you can perform the following steps.

  1. Log into the Cloud Director Availability Portal.
  2. Select an affected replication.
  3. Click All actions.
  4. Under Settings, click Replication settings.
  5. Toggle the Compress replication traffic option.
  6. Click Apply.
  7. Click All actions again.
  8. Under Settings, click Replication settings.
  9. Toggle the Compress replication traffic option back to its original setting.
  10. Click Apply.