VM communication issue and ESXi Host transport node in Unknown status with controller connectivity also in Unknown.
search cancel

VM communication issue and ESXi Host transport node in Unknown status with controller connectivity also in Unknown.

book

Article ID: 398894

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX was recently upgraded to 4.2.x.
During ESXi host boot/reboot the nestdb agent remains down.
nsxcli -c get controllers  returns  % Failed to get controller list

NSX Manager > System > Nodes > [Node Name] > Monitor > Agent Status Down.
Log lines similar to the below are encountered in /var/run/log/syslog.log on the ESXi host:
In(14) jumpstart[2099923]: executing start plugin: nsx-pre-nestdb
In(14) jumpstart[2099923]: executing start plugin: nsx-nestdb
In(14) jumpstart[2099923]: nsx-nestdb started.
In(30) NSX[2101624]: nsx-pre-nestdb started
In sequence:
Starting nsx-pre-nestdb.
Starting nsx-nestdb (but nsx-pre-nestdb is not fully started yet).
nsx-nestdb completes starting sequence (before nsx-pre-nestdb fully starts).
nsx-pre-nestdb completes starting sequence. We are in the situation where nestdb did not start correctly.
nsx-nestdb shows as stopped:
/etc/init.d/nsx-nestdb status
stopped

There are no nsx-nestdb core dumps or other logging that indicates that nsx-nestdb has crashed.
The Issue is intermittent. For some reboots the nsx-nestdb starts without issue.
Log lines similar to the below are encountered in /var/run/log/vmkernel.log on the ESXi host:
2025-05-08T01:35:36.424Z In(182) vmkernel: cpu99:2109700)Fil3: 385: Caller Fil3_CreateAndOpenFile vol OS####-########-#######-####-########3fe0 took 35324 ms wantOptlocking: 1,
2025-05-08T01:35:36.428Z In(182) vmkernel: cpu107:2098699)Fil3: 385: Caller Fil3_CreateAndOpenFile vol OS####-########-#######-####-########3fe0 took 79479 ms wantOptlocking: 1,
2025-05-08T01:35:36.429Z In(182) vmkernel: cpu13:2109510)Fil3: 385: Caller Fil3Lookup vol OS####-########-#######-####-########3fe0 took 79469 ms wantOptlocking: 1,
2025-05-08T01:35:36.430Z In(182) vmkernel: cpu177:2109478)Fil3: 385: Caller Fil3Lookup vol OS####-########-#######-####-########3fe0 took 79290 ms wantOptlocking: 1,
2025-05-08T01:35:36.431Z In(182) vmkernel: cpu210:2109603)Fil3: 385: Caller Fil3Lookup vol OS####-########-#######-####-########3fe0 took 75463 ms wantOptlocking: 1,
2025-05-08T01:35:36.432Z In(182) vmkernel: cpu18:2108616 opID=e14f64e6)Fil3: 385: Caller Fil3Lookup vol OS####-########-#######-####-########3fe0 took 36900 ms wantOptlocking: 1,

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

IO delays in the ESXi Host transport nodes will result in nsx-pre-nestdb script taking longer than expected to complete. 
This causes a delay in the execution of nsx-pre-nestdb which in turn affects nsx-nestdb. 
Ideally the nsx-nestdb starts after the nsx-pre-nestdb completes its execution. 
If the nsx-nestdb starts before the completion of the nsx-pre-nestdb then the nsx-nestdb remains in down state.

Resolution

There is no fix as the issue is due to underlaying IO issues

Kindly perform the below workaround to recover the ESXi hosts.
Workaround:
Manually start nestdb service with the following command to resolve the Host Transport Node issue:
/etc/init.d/nsx-nestdb start

Additional Information

If you are running VMware NSX 4.2.x and the nsx-nestdb is down but there are no IO error in the ESXi host then the issue is fixed in NSX 4.2.1.1 available at Broadcom Downloads
Steps to locate and download Broadcom products and software are available at Download Broadcom products and software 
https://knowledge.broadcom.com/external/article?articleNumber=381319