Unable to vMotion VM's on NSX portgroup - vdl2 down error [JDK Issue]

Products

VMware NSX

Issue/Introduction

You are unable to vMotion virtual machines between ESXi hosts in an NSX environment. When attempting the migration, the following error is displayed:

Currently connected network interface 'Network adapter 1' uses network 'DVSwitch[## ## ## ## ## ## ## ## ## ## ## NSX port group [dvportgroup-####](vdl2 down)', which is not accessible.

The vdl2 component status is reported as down on the affected ESXi. This can be verified by running the following command on the affected host:

[~] net-dvs -l | grep status.component.vdl2
                com.vmware.common.opaqueDvs.status.component.vdl2 = down ,        propType = RUNTIME

Standard recovery steps, such as host reboots or unprepping/reprepping NSX on the host, do not resolve the issue.
While running nsxcli -c get controllers and nsxcli -c get managers commands on the ESXi may show as connected, however the underlying transport remains impacted.
On NSX manager, the concerned ESXi Host TNs does not complete full syncs due to the known "Java 11 commonPool" issue.

<NSX manager>#cat /var/log/cloudnet/nsx-ccp.log | grep -i "ForkJoinPool.commonPool" | grep -i "<TN UUID>"
  --------<No result for the above validation, No entry's reported confirms JDK issue>--------

Running get core-dump on the admin NSX Manager shell may show one or more cbm_oom core dump files.

NSX Manager>get core-dump
### ### 21 20## UTC 12:##:30.987
Directory: /image/core

          18#83##6        Jan ## #### 14:##:## UTC core.cbm_oom.#####.#####.hprof.gz
          18#48##7        Jan ## #### 16:##:## UTC core.cbm_oom.#####.#####.hprof.gz

NSX Manager>

On the NSX Manager, tanuki.log shows Restarting JVM events :

root@<NSX_Manager> : cat /var/log/cbm/tanuki.log | grep -i "restarting jvm"
STATUS | wrapper | YYYY/MM/DD HH:MM:SS | The JVM has run out of memory.   Restarting JVM.
STATUS | wrapper | YYYY/MM/DD HH:MM:SS | The JVM has run out of memory.   Restarting JVM.
STATUS | wrapper | YYYY/MM/DD HH:MM:SS | The JVM has run out of memory.   Restarting JVM.
root@<NSX_Manager>

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.0, 4.2.1.1, 4.2.1.2, 4.2.1.3
VMware NSX 9.0.0.0

Cause

Due to a known issue in JDK, (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), if there is no corfu compactor leader change for a long time, the corfu compactor leader may become unresponsive and not trigger compaction cycles, unless Corfu server is restarted. This leads to CBM service running out of memory and crashing.

Resolution

This issue is resolved in VMware NSX 4.2.1.4, 4.2.2.0, 9.0.1.0 and above, available at Broadcom downloads. If having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

To workaround this issue, perform a rolling reboot of the NSX managers, as mentioned in KB : NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.

For environments running affected versions (see "Environment" section), implement a preventative monthly rolling reboot schedule:

Reboot the first NSX Manager.
SSH to a Manager as admin user and check cluster health: get cluster status
When all services report up on all 3 NSX Manager nodes, reboot the next Manager.
Repeat steps 2-3 for the third Manager.

Once all the Manager reboots are done and the cluster is stable, Re-try vMotion.

Additional Information

Additional Related Articles :