Edge Transport node is in 'Failed Configuration' state in NSX UI
search cancel

Edge Transport node is in 'Failed Configuration' state in NSX UI

book

Article ID: 396722

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Edge Transport node is in 'Failed Configuration' state in NSX UI
  • Communication issues may be observed with virtual machines if they are on NSX-backed segments related to the affected edge node
  • Tunnels between the affected edge node and other NSX transport nodes may be down or degraded
  • Controller connectivity on the affected node may be down, i.e., the get controllers command run on the affected edge node produces the error, % Failed to get controller list
    The controller status for the affected edge node may show as Down in the NSX UI
  • A message similar to the following may be present in the NSX UI:

    Host configuration: Caught MessagingException during host config stage. [TN=TransportNode/c5a965c6-####-####-####-17ad46d9b83c]. Reason: MessagingException

  • Messages similar to the following may be present in the /var/log/syslog file on the NSX manager nodes:

    2026-01-07T00:00:46.024Z ERROR L2HostConfigTaskExecutor5 TransportNodeAsyncServiceImpl 5143 FABRIC [nsx@6876 comp="nsx-manager" errorCode="MP100" level="ERROR" subcomp="manager"] Caught MessagingException during host config stage. [TN=TransportNode/c5a965c6-####-####-####-17ad46d9b83c]. Reason: MessagingException
    com.vmware.nsx.messaging.exceptions.MessagingException: null
    at com.vmware.nsx.messaging.rpc.RpcManager.invokeOutgoingRequestTimeoutErrorHandler(RpcManager.java:609) ~[?:?]
    at com.vmware.nsx.messaging.rpc.RpcManager.access$700(RpcManager.java:66) ~[?:?]
    at com.vmware.nsx.messaging.rpc.RpcManager$RequestMapsCleanupTask.runCleanup(RpcManager.java:1026) ~[?:?]
    at com.vmware.nsx.messaging.rpc.RpcManager$RequestMapsCleanupTask.run(RpcManager.java:993) ~[?:?]
    at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_382]
    at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_382]
    2026-01-07T00:00:46.024Z  INFO L2HostConfigTaskExecutor5 TransportNodeStateServiceImpl 5143 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Incoming Revision: [1024] Obj: [TnStateInternal [id=c5a965c6-####-####-####-17ad46d9b83c, retryCount=0, vmkMigrationFailures=0, revision=1024, stageToStatusMap={HostConfig=TnStageStatus [stageName=HostConfig, status=FAILED, errorCode=8816, errorParams=[c5a965c6-####-####-####-17ad46d9b83c, MessagingException], timeStamp=2026-Jan-07 00.00.46 AM, errorMessage=Caught MessagingException during host config stage. [TN=TransportNode/c5a965c6-####-####-####-17ad46d9b83c]. Reason: MessagingException]}]]

  • Messages similar to the following may be present in the /var/log/syslog file on the affected edge node:

    2025-04-24T14:39:39.017Z ######## NSX 1 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="nestdb" level="ERROR" errorCode="########"] DB is not connected while performing write operation
    2025-04-24T14:39:39.004Z ######## nsxa-systemd-helper 7467 - -  2025-04-24T14:39:39Z nsxa 1 nestdb [ERROR] DB is not connected while performing write operation  errorCode="########"
    2025-04-24T14:39:39.164Z ######## nsxa-systemd-helper 7467 - -  2025-04-24T14:39:39Z nsxa 1 nestdb [ERROR] DB is not connected while performing write operation  errorCode="########"

    2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="INFO"] StreamSocket[1241234 Open f:28 i:1134202767  -> unix:///var/run/vmware/nestdb/nestdb-server.sock] async_connect
    2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="INFO"] StreamSocket[1241234 Open f:28 i:1134202767  -> unix:///var/run/vmware/nestdb/nestdb-server.sock] on_connect 2-No such file or directory
    2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="WARNING"] StreamConnection[1241234 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:1241234] Couldn't connect to 'unix:///var/run/vmware/nestdb/nestdb-server.sock' (error: 2-No such file or directory)
    2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="11" level="WARNING"] StreamConnection[1241234 Error to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:-1] Error 2-No such file or directory
    2026-01-07T10:03:46.392Z ######## NSX 1 - [nsx@6876 comp="nsx-edge" s2comp="nsx-rpc" tid="11" level="WARNING"] RpcConnection[1241234 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock 0] Couldn't connect to unix:///var/run/vmware/nestdb/nestdb-server.sock (error: 2-No such file or directory)
    2026-01-07T10:03:46.393Z ######## NSX 3057 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="3221" level="INFO"] StreamSocket[1241287 Init f:-1 i:-1  -> unix:///var/run/vmware/nestdb/nestdb-server.sock] Created


  • Checking for running nestdb processes on the affected edge node shows that more than one nestdb process is running:

    # ps -ef |grep nestdb |grep -v watchdog
    3510    3491   994 nestdb   00:06:26  0.0  0.7 237104 293796 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /config/vmware/nsx/nestdb/db --txn_log_size 209715200 --mem_stats_interval 300 --mem_release_interval 86400 --metrics_text_publisher --metrics_rpc_publisher --listen unix:///var/run/vmware/nestdb/nestdb-server.sock --listen ssl-unix:///var/run/vmware/nestdb/nestdb-server-ssl.sock
    1156706    3616   994 nestdb   00:00:00  1.9  0.1 37400  99232 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /config/vmware/nsx/nestdb/db --txn_log_size 209715200 --mem_stats_interval 300 --mem_release_interval 86400 --metrics_text_publisher --metrics_rpc_publisher --listen unix:///var/run/vmware/nestdb/nestdb-server.sock --listen ssl-unix:///var/run/vmware/nestdb/nestdb-server-ssl.sock
    1156720    3193   994 nestdb   00:00:00  2.1  0.1 37376  99232 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /config/vmware/nsx/nestdb/db --txn_log_size 209715200 --mem_stats_interval 300 --mem_release_interval 86400 --metrics_text_publisher --metrics_rpc_publisher --listen unix:///var/run/vmware/nestdb/nestdb-server.sock --listen ssl-


Environment

VMware NSX 

Cause

Multiple instances of NestDB are started. This causes unpredictable behavior from the perspective of the NestDB clients, as some clients operate on one instance while other clients operate on another.

The NestDB server startup script, like many other LCP daemons, uses pidof to determine if the process has been started. If it does not detect that the process has started, the startup script launches another instance of the watchdog, which in turn attempts to launch another instance of NestDB.

This works fine under normal circumstances, but pidof does *not* return processes that are in the uninterruptible sleep state (D) or the zombie state (Z) by default on some linux distributions, including Ubuntu 20.04 (Ubuntu version on this Edge VM).

An example of logging in  wherein NestDB is in an uninterruptable sleep state is below:

var/log/vmware/top-cpu.log:

Tue Sep 05 16:22:17 UTC 2025
PID   USER    PR  NI    VIRT    RES      SHR    S  %CPU  %MEM     TIME+    TGID COMMAND
2##2 nestdb   20   0   83212  24180  14576  D  16.5   0.1   0:00.17    2092 /opt/vmware/nsx-nestdb/bin/nestdb-server --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --dat+

Please reference Manpages for ubuntu pidof8 or Why is pidof not working for further context.
This is not done because it can cause pidof and calling scripts to hang in such cases.

 

Resolution

This issue is resolved in VMware NSX 4.2.0 available at Broadcom Downloads

Workaround:

To workaround this issue, the affected edge node can be rebooted.

The risk can be avoided by ensuring a healthy infra/disk.