Troubleshooting NSX Host Agents

Products

VMware NSX

Issue/Introduction

NSX host agents that run at the user world layer on the ESXi host to facilitate the realization of logical switch ports and the associated configurations/properties by interacting with components on the NSX unified appliance, as well as other components on the ESXi host such as kernel modules, vSphere libraries, and non NSX agents.
NSX agents include nsx-proxy, nsx-nestdb, nsx-cfgagent, nsx-opsagent, nsx-nestdb.

Environment

VMware NSX

VMware NSX-T Data Center

Resolution

Host Agent Overview:

There are 3 Agents whose status is tracked in the NSX UI:

nsx-cfgagent ← Interacts with dataplane modules like VDL2, KCP, and DFW

When stopped, Host appears Down in UI

nsx-opsagent ← Includes nsx-da (inventory discovery agent which communicates with nestdb) and nsxa (deals with host switch related operations)

When stopped, Host appears Down in UI

nsx-nestdb ← Stores desired state from control plane and runtime state info from dataplane

When stopped, controller connectivity will be Down and 'get controllers' returns 'Failed to get controller list'.

Other agents which are not tracked in UI "Agent Status" pane, and what happens when they are stopped:

nsx-proxy ← Interacts with both Policy and CCP on the NSX manager, and nsx-opsagent and nsx-nestdb on host

When stopped, NSX Configuration is listed as "Host Disconnected". Manager connectivity is Down, and Agent Status is not reported.

nsx-sfhc ← This is the installation agent for NSX deployment that communicates with MP

When stopped, UI shows host with status "Install Failed" and View Details is not available.

2. Agent Status in NSX UI

In the 3,2.2+ NSX UI, Host Agent Status for 3 agents is viewed at: System > Fabric > Hosts > View Details on the Host > Monitor > Agent Status

If one Agent (opsagent, cfgagent, nestdb) is stopped, the Host's overall Status will show as Down

3. Commands

To manage agent service from ESX command line: /etc/init.d/<service name> status/stop/start

'esxcli network ip connection list | grep 1234' and 'esxcli network ip connection list | grep 1235' will show connections to the Managers and Controller with World Name of 'nsx-proxy' ← If nsx-proxy is stopped, these connections will not be listed

4. Logs

nsx-proxy logs are in nsx-syslog.log. This will show logging for the the nsx-proxy agent connection to the Manager components.

ag -i "nsx-proxy" var/run/log/nsx-syslog.log

nsx-proxy heartbeats from MP appear like this:

2022-08-11T04:25:56Z nsx-proxy: NSX 2101571 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2101571" level="INFO"] MessagingClientService: Heartbeat message received in FrameworkUnifiedMsg from endpoint: ssl://10.105.8.11:1234 client_id: a5887a11-c352-426c-951b-b2def1ea4806

2022-08-11T04:26:56Z nsx-proxy: NSX 2101571 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2101571" level="INFO"] MessagingClientService: Heartbeat message received in FrameworkUnifiedMsg from endpoint: ssl://10.105.8.11:1234 client_id: a5887a11-c352-426c-951b-b2def1ea4806

To check for gaps in heartbeats from management plane, run:

ag "Heartbeat message" nsx-syslog* | awk '{print $1}' | cut -d ":" -f3- | sort -V | cut -d":" -f1 | uniq -c

There are usually 60 an hour. A high-level look at heartbeat loss looks like this:

60 2022-08-20T02

60 2022-08-20T03

7 2022-08-20T04 ← only 7 heartbeats from MP this hour

32 2022-08-25T00 ← only 32 heartbeats from MP this hour

60 2022-08-25T01

60 2022-08-25T02

nsx-nestdb logs are in nsx-syslog.log:

ag -i "nestdb" var/run/log/nsx-syslog.log

nsx-cfgagent logs are in nsx-syslog.log:

ag -i "cfgagent" var/run/log/nsx-syslog.log

nsx-opsagent logs are in nsx-syslog.log:

ag -i "opsagent" var/run/log/nsx-syslog.log

If nsxda cannot connect with nestdb, logging in nsx-syslog.log appears like this:

2022-12-01T18:39:50.058Z nsx-opsagent[2103501]: NSX 2103501 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxda" tid="2105253" level="WARNING"] Waiting for NestDB to connect.

2022-12-01T18:39:53.931Z nsx-opsagent[2103501]: NSX 2103501 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxda" tid="2105252" level="WARNING"] Waiting for NestDB to connect.