ESXi host remediation via vLCM stuck at 5% due to NSX upgrade coordinator out of memory

search cancel

ESXi host remediation via vLCM stuck at 5% due to NSX upgrade coordinator out of memory

book

Article ID: 430285

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

When attempting to update ESXi hosts using a vSphere Lifecycle Manager (vLCM) Image, the remediation task becomes stuck at 5% during the "Finished compliance check for cluster" phase. Restarting the VMware vCenter Service Appliance (VCSA) Update Manager service or HA does not resolve the hang.

Symptoms observed in the NSX Manager logs (nsx-manager subcomponent upgrade-coordinator) include warnings for long-running transactions:

WARN tx-tracer-poller UfoTxnTracingService ... comp="nsx-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService... long running tx has been running for seconds=3022691

Environment

Product: VMware NSX
Versions: Versions prior to 4.2.1.4
Related Components: vSphere Lifecycle Manager (vLCM), ESXi

Cause

The Upgrade Coordinator service fails because of an "out of memory" condition. This is caused by a large number of threads entering a "WAITING" state, which fills the heap. This behavior is triggered by a known JDK issue ( $JDK-8330017$ ) where the ForkJoinPool stops executing tasks due to a Release Count (RC) overflow in the ctl field.

Resolution

This issue is resolved in VMware NSX 4.2.1.4.

For environments running affected versions, Broadcom recommends a rolling reboot of NSX Managers prior to upgrading. Implement the following monthly rolling reboot schedule as a preventative measure:

Reboot the first NSX Manager node.
Log in to an NSX Manager via SSH as an admin user and verify cluster health: get cluster status.
Once all services report as "up" on all three nodes, proceed to reboot the next Manager.
Repeat the process for the final Manager node.

Additional Information

Feedback

thumb_up Yes

thumb_down No