When attempting to update ESXi hosts using a vSphere Lifecycle Manager (vLCM) Image, the remediation task becomes stuck at 5% during the "Finished compliance check for cluster" phase. Restarting the VMware vCenter Service Appliance (VCSA) Update Manager service or HA does not resolve the hang.
Symptoms observed in the NSX Manager logs (nsx-manager subcomponent upgrade-coordinator) include warnings for long-running transactions:
WARN tx-tracer-poller UfoTxnTracingService ... comp="nsx-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService... long running tx has been running for seconds=3022691
Product: VMware NSX
Versions: Versions prior to 4.2.1.4
Related Components: vSphere Lifecycle Manager (vLCM), ESXi
The Upgrade Coordinator service fails because of an "out of memory" condition. This is caused by a large number of threads entering a "WAITING" state, which fills the heap. This behavior is triggered by a known JDK issue (JDK-8330017) where the ForkJoinPool stops executing tasks due to a Release Count (RC) overflow in the ctl field.
This issue is resolved in VMware NSX 4.2.1.4.
For environments running affected versions, Broadcom recommends a rolling reboot of NSX Managers prior to upgrading. Implement the following monthly rolling reboot schedule as a preventative measure:
Reboot the first NSX Manager node.
Log in to an NSX Manager via SSH as an admin user and verify cluster health: get cluster status.
Once all services report as "up" on all three nodes, proceed to reboot the next Manager.
Repeat the process for the final Manager node.