NSX Edge Configuration State "Failed" with "Caught MessagingException"
search cancel

NSX Edge Configuration State "Failed" with "Caught MessagingException"

book

Article ID: 430888

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The NSX Edge Transport Node reports a configuration state of "Failed" in the NSX Manager UI, accompanied by an active alarm for "Failure Domain Down".
  • Clicking the "Failed" status displays the following error message:
    • "Caught MessagingException during host config stage."
  • In the /var/log/syslog file on the Edge Node, the following snippets are seen :

xx--yy-zzT09:53:10.272Z edge#### NSX 3612 - [nsx@6876 comp="nsx-edge" s2comp="nsx-net" tid="3630" level="WARNING"] StreamConnection[12625 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock 0] Couldn't connect to 'unix:///var/run/vmware/nestdb/nestdb-server.sock' (error: 2-No such file or directory)
xx--yy-zzT09:53:10.272Z edge#### NSX 3314 - [nsx@6876 comp="nsx-edge" s2comp="nestdb-client" tid="3402" level="WARNING"] NestDbClient: failed to get stub to unix:///var/run/vmware/nestdb/nestdb-server.sock, retrying in 5000 ms...

File "/opt/vmware/nsx-netopa/lib/python/sha/core/profile/_nestdb_client.py", line 277, in _setup_monitor
    NestdbConnCtl.start(self._name)
  File "/opt/vmware/nsx-netopa/lib/python/sha/core/profile/_nestdb_conn.py", line 88, in start
    cls._client.start(name)
  File "/opt/vmware/nsx-netopa/lib/python/sha/core/profile/_nestdb_conn.py", line 38, in start
    self._stub = NsxRpcClient(NestDb_Stub, self._endpoint)
  File "/opt/vmware/nsx-netopa/lib/python/vmware/nsx/rpc/client/client.py", line 105, in __init__
    self._own_connection.Connect(endpoint, sec_ctx=sec_ctx, retry_policy=retry_policy)
  File "/opt/vmware/nsx-netopa/lib/python/vmware/nsx/rpc/client/transport.py", line 224, in Connect
    self._Connect(endpoint, sec_ctx)
  File "/opt/vmware/nsx-netopa/lib/python/vmware/nsx/rpc/client/transport.py", line 233, in _Connect
    self.sock.connect(path)
  File "/opt/vmware/nsx-netopa/lib/python/gevent/_socketcommon.py", line 590, in connect
    self._internal_connect(address)
  File "/opt/vmware/nsx-netopa/lib/python/gevent/_socketcommon.py", line 655, in _internal_connect
    raise _SocketError(result, strerror(result))
FileNotFoundError: [Errno 2] No such file or directory

Environment

VMware NSX

Cause

This issue is caused by excessive latency on the underlying physical storage backing the Edge VM, preventing internal services from functioning correctly.

Diagnosis:

  • While CPU and Memory resources on the Edge VM were confirmed healthy, esxtop analysis on the hosting ESXi server revealed exceptionally high DAVG (Device Average Latency) values. 
  • This was confirmed by running : esxtop on the associated ESXi Host hosting the Edge Node.

Impact: The high storage latency prevents the Edge's internal database (nestdb) from performing read/write operations within required timeout windows. This causes internal services to crash or stall, leading to the configuration failure and control plane disconnection.

Resolution

The primary resolution requires addressing the storage performance at the infrastructure level. The NSX Edge failure is a symptom, not the root cause.

  1. Validate Storage Health: Contact the Storage or Virtualization team to investigate the backend storage array for latency issues or high load.

  2. Monitor DAVG: Ensure the Device Average Latency (DAVG) on the ESXi host drops to normal operating levels (typically < 10-20ms).

  3. Automatic Recovery: Once storage latency normalizes, the NSX Edge services should automatically reconnect to the internal database and the Control Plane.