Sitewide RPO violations re-occur for replications in VMware Cloud Director Availability 4.x
search cancel

Sitewide RPO violations re-occur for replications in VMware Cloud Director Availability 4.x

book

Article ID: 315034

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

Symptoms:

  • In a Cloud Director Availability On-Premises to Cloud configuration, all replications start to experience RPO violations after running without issue for days or weeks.
  • Rebooting the Cloud Director Availability On-Premise Appliance allows the replications to successfully recover and return to a healthy state.
  • In the /opt/vmware/h4/lwdproxy/log directory, there is recent file with a hprof extension with a name similar to java_pid2064.hprof.
  • In the /opt/vmware/h4/lwdproxy/log/lwdproxy.log files, there are out of memory errors similar to:
     2023-08-15 00:24:51,434 WARN [Worker-3-1] c.v.h.p.h.InitSessionHandler [InitSessionHandler.java:224] Handshake relay to server null failed for group H4-a1b2c3d4-####-####-####-########bbd
    java.lang.OutOfMemoryError: Java heap space
        at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:443)
        at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1074)
        at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1061)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:1008)
        ...
    2023-08-17 13:54:37,101 INFO [main] c.v.h.p.App [App.java:86] Restored routing table from persistent storage.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Cloud Director Availability 4.x

Cause

This issue occurs when there is an unusually high number of connections to the lwdproxy service that are abnormally terminated. The service contains a feature that tracks all connections that died, which can increasingly consume memory over time.

Resolution

This is a known issue affecting Cloud Director Availability 4.x.
Currently, there is no resolution.

Workaround:
If the cause of the connection terminations cannot be determined in the infrastructure, you can work around this issue by disabling the tracking and prevent the issue from occurring:

  1. SSH to the On-Premises Cloud Director Availability Appliance and log in as root.
  2. Navigate to the lwdproxy configuration directory:
cd /opt/vmware/h4/lwdproxy/conf/
  1. Take a backup of the lwdproxy.properties file:
cp lwdproxy.properties lwdproxy.properties.bak
  1. Open the lwdproxy.properties with a text editor.
  2. Change the TRAFFIC_ACCOUNTING setting from true to false:
TRAFFIC_ACCOUNTING=false
  1. Restart the lwdproxy service:
systemctl restart lwdproxy.service


Note: This change will cause Cloud to On-Premises replications from this location to appear to have no real-time traffic. Historical traffic, on-premise to cloud, and cloud to cloud traffic data is unaffected.