SDDC Manager commonsvcs crashes with status: java.lang.OutOfMemoryError: GC overhead limit exceeded
search cancel

SDDC Manager commonsvcs crashes with status: java.lang.OutOfMemoryError: GC overhead limit exceeded

book

Article ID: 372617

calendar_today

Updated On:

Products

VMware SDDC Manager VMware Cloud Foundation VMware Cloud Foundation 4.x VMware Cloud Foundation 5.x

Issue/Introduction

Symptoms:

SDDC Manager UI is not accessible.

Workflows in SDDC Manager fail.

The commonsvcs service on the SDDC Manager is not active.

Checking the status with the command: systemctl status commonsvcs.services -l shows the message: java.lang.OutOfMemoryError: GC overhead limit exceeded

We see similar log entries in /var/log/vmware/vcf/commonsvcs/vcf-commonsvcs.log:

java.lang.RuntimeException: Failed to get instance id and ceip status from common services.
        at com.vmware.vcf.telemetry.vac.ph.client.VcfTelemetryProvider.getTelemetryInfo(VcfTelemetryProvider.java:93)
        at com.vmware.vcf.telemetry.vac.ph.client.VcfTelemetryProvider.initInstanceId(VcfTelemetryProvider.java:40)
        at com.vmware.vcf.telemetry.vac.ph.client.VcfTelemetryProvider.isCeipEnabled(VcfTelemetryProvider.java:70)
        ...
      ...
Caused by: com.vmware.cloud.foundation.rest.commonsvcs.runtime.ApiException: java.net.SocketTimeoutException: timeout
      at com.vmware.cloud.foundation.rest.commonsvcs.runtime.ApiClient.execute(ApiClient.java:845)
      ...
Caused by: java.net.SocketTimeoutException: timeout
        at okio.SocketAsyncTimeout.newTimeoutException(Okio.kt:149)
      at okio.AsyncTimeout.access$newTimeoutException(AsyncTimeout.kt:162)
      ...
Caused by: java.net.SocketException: Socket closed
        at java.net.SocketInputStream.read(SocketInputStream.java:204)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)

Environment

VMware Cloud Foundation 4.4.x

VMware Cloud Foundation 4.5.x

VMware Cloud Foundation 5.0.x

VMware Cloud Foundation 5.1.x

Cause

For password tasks in a CANCELLED state in the SDDC Manager, the commonsvcs service considers those tasks to be in a PENDING state. If there are too many of these CANCELLED password tasks, the commonsvcs service keeps polling them once per minute and eventually reaches a state of memory exhaustion, causing the service to crash.

Resolution

This issue will be resolved in a future VCF release.

Workaround:

  • Run the deregister-cancelled-task.sh script (attached below) on the SDDC Manager as root.

    1. Download and transfer the script to the SDDC Manager.

    2. SSH to the SDDC Manager with the vcf user and then su root.

    3. Update permissions to make the script executable:
      chmod +x deregister-cancelled-task.sh

    4. Execute the script clean up the CANCELLED password tasks:
      ./deregister-cancelled-task.sh

  • (Optional) Add the script in the existing cron jobs or create a new corn tab.
    This will automate the task clean-up if there are regular failures on password rotation job and those workflows need to be frequently cancelled.
    The recommendation is to run the script once monthly for frequent rotation failures.

Attachments

deregister-cancelled-task.sh get_app