Edge Transport node is in 'Failed Configuration' state in NSX UI

Products

VMware NSX

Issue/Introduction

Edge Transport node is in 'Failed Configuration' state in NSX UI
Communication issues may be observed with virtual machines if they are on NSX-backed segments related to the affected edge node
Tunnels between the affected edge node and other NSX transport nodes may be down or degraded
Controller connectivity on the affected node may be down, i.e., the get controllers command run on the affected edge node produces the error, % Failed to get controller list
The controller status for the affected edge node may show as Down in the NSX UI
A message similar to the following may be present in the NSX UI:

Host configuration: Caught MessagingException during host config stage. [TN=TransportNode/c5a965c6-####-####-####-17ad46d9b83c]. Reason: MessagingException
Messages similar to the following may be present in the /var/log/syslog file on the NSX manager nodes:

2026-01-07T00:00:46.024Z ERROR L2HostConfigTaskExecutor5 TransportNodeAsyncServiceImpl 5143 FABRIC [nsx@6876 comp="nsx-manager" errorCode="MP100" level="ERROR" subcomp="manager"] Caught MessagingException during host config stage. [TN=TransportNode/c5a965c6-####-####-####-17ad46d9b83c]. Reason: MessagingException
com.vmware.nsx.messaging.exceptions.MessagingException: null
at com.vmware.nsx.messaging.rpc.RpcManager.invokeOutgoingRequestTimeoutErrorHandler(RpcManager.java:609) ~[?:?]
at com.vmware.nsx.messaging.rpc.RpcManager.access$700(RpcManager.java:66) ~[?:?]
at com.vmware.nsx.messaging.rpc.RpcManager$RequestMapsCleanupTask.runCleanup(RpcManager.java:1026) ~[?:?]
at com.vmware.nsx.messaging.rpc.RpcManager$RequestMapsCleanupTask.run(RpcManager.java:993) ~[?:?]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_382]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_382]
2026-01-07T00:00:46.024Z INFO L2HostConfigTaskExecutor5 TransportNodeStateServiceImpl 5143 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Incoming Revision: [1024] Obj: [TnStateInternal [id=c5a965c6-####-####-####-17ad46d9b83c, retryCount=0, vmkMigrationFailures=0, revision=1024, stageToStatusMap={HostConfig=TnStageStatus [stageName=HostConfig, status=FAILED, errorCode=8816, errorParams=[c5a965c6-####-####-####-17ad46d9b83c, MessagingException], timeStamp=2026-Jan-07 00.00.46 AM, errorMessage=Caught MessagingException during host config stage. [TN=TransportNode/c5a965c6-####-####-####-17ad46d9b83c]. Reason: MessagingException]}]]
Messages similar to the following may be present in the /var/log/syslog file on the affected edge node:

2025-04-24T14:39:39.017Z ######## NSX 1 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb" level="ERROR" errorCode="########"] DB is not connected while performing write operation
2025-04-24T14:39:39.004Z ######## nsxa-systemd-helper 7467 - - 2025-04-24T14:39:39Z nsxa 1 nestdb [ERROR] DB is not connected while performing write operation errorCode="########"
2025-04-24T14:39:39.164Z ######## nsxa-systemd-helper 7467 - - 2025-04-24T14:39:39Z nsxa 1 nestdb [ERROR] DB is not connected while performing write operation errorCode="########"

2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="INFO"] StreamSocket[1241234 Open f:28 i:1134202767 -> unix:///var/run/vmware/nestdb/nestdb-server.sock] async_connect
2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="INFO"] StreamSocket[1241234 Open f:28 i:1134202767 -> unix:///var/run/vmware/nestdb/nestdb-server.sock] on_connect 2-No such file or directory
2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="WARNING"] StreamConnection[1241234 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:1241234] Couldn't connect to 'unix:///var/run/vmware/nestdb/nestdb-server.sock' (error: 2-No such file or directory)
2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="WARNING"] StreamConnection[1241234 Error to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:-1] Error 2-No such file or directory
2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-rpc" tid="11" level="WARNING"] RpcConnection[1241234 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock 0] Couldn't connect to unix:///var/run/vmware/nestdb/nestdb-server.sock (error: 2-No such file or directory)
2026-01-07T10:03:46.393Z ######## NSX 3057 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="3221" level="INFO"] StreamSocket[1241287 Init f:-1 i:-1 -> unix:///var/run/vmware/nestdb/nestdb-server.sock] Created
Checking for running nestdb processes on the affected edge node shows that more than one nestdb process is running:

# ps -ef |grep nestdb |grep -v watchdog
3510 3491 994 nestdb 00:06:26 0.0 0.7 237104 293796 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /config/vmware/nsx/nestdb/db --txn_log_size 209715200 --mem_stats_interval 300 --mem_release_interval 86400 --metrics_text_publisher --metrics_rpc_publisher --listen unix:///var/run/vmware/nestdb/nestdb-server.sock --listen ssl-unix:///var/run/vmware/nestdb/nestdb-server-ssl.sock
1156706 3616 994 nestdb 00:00:00 1.9 0.1 37400 99232 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /config/vmware/nsx/nestdb/db --txn_log_size 209715200 --mem_stats_interval 300 --mem_release_interval 86400 --metrics_text_publisher --metrics_rpc_publisher --listen unix:///var/run/vmware/nestdb/nestdb-server.sock --listen ssl-unix:///var/run/vmware/nestdb/nestdb-server-ssl.sock
1156720 3193 994 nestdb 00:00:00 2.1 0.1 37376 99232 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /config/vmware/nsx/nestdb/db --txn_log_size 209715200 --mem_stats_interval 300 --mem_release_interval 86400 --metrics_text_publisher --metrics_rpc_publisher --listen unix:///var/run/vmware/nestdb/nestdb-server.sock --listen ssl-

Environment

VMware NSX

Cause

Multiple instances of NestDB are started. This causes unpredictable behavior from the perspective of the NestDB clients, as some clients operate on one instance while other clients operate on another.

The NestDB server startup script, like many other LCP daemons, uses pidof to determine if the process has been started. If it does not detect that the process has started, the startup script launches another instance of the watchdog, which in turn attempts to launch another instance of NestDB.

This works fine under normal circumstances, but pidof does *not* return processes that are in the uninterruptible sleep state (D) or the zombie state (Z) by default on some linux distributions, including Ubuntu 20.04 (Ubuntu version on this Edge VM).

An example of logging in wherein NestDB is in an uninterruptable sleep state is below:

var/log/vmware/top-cpu.log:

Tue Sep 05 16:22:17 UTC 2025
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TGID COMMAND
2##2 nestdb 20 0 83212 24180 14576 D 16.5 0.1 0:00.17 2092 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --dat+

Please reference Manpages for ubuntu pidof8 or Why is pidof not working for further context.
This is not done because it can cause pidof and calling scripts to hang in such cases.

Resolution

This issue is resolved in VMware NSX 4.2.0 available at Broadcom Downloads.

Workaround:

To workaround this issue, the affected edge node can be rebooted.

The risk can be avoided by ensuring a healthy infra/disk.