Troubleshooting NSX Host Agents
search cancel

Troubleshooting NSX Host Agents

book

Article ID: 381955

calendar_today

Updated On: 12-19-2024

Products

VMware NSX

Issue/Introduction

  • NSX host agents that run at the user world layer on the ESXi host to facilitate the realization of logical switch ports and the associated configurations/properties by interacting with components on the NSX unified appliance, as well as other components on the ESXi host such as kernel modules, vSphere libraries, and non NSX agents.
  • NSX agents include nsx-proxy, nsx-nestdb, nsx-cfgagent, nsx-opsagent, nsx-nestdb.

 

Environment

VMware NSX

VMware NSX-T Data Center

Resolution

  1. Host Agent Overview:

There are 3 Agents whose status is tracked in the NSX UI:

nsx-cfgagent     ← Interacts with dataplane modules like VDL2, KCP, and DFW

When stopped, Host appears Down in UI

nsx-opsagent    ← Includes nsx-da (inventory discovery agent which communicates with nestdb) and nsxa (deals with host switch related operations) 

When stopped, Host appears Down in UI

nsx-nestdb       ← Stores desired state from control plane and runtime state info from dataplane

When stopped, controller connectivity will be Down and 'get controllers' returns 'Failed to get controller list'.

 

Other agents which are not tracked in UI "Agent Status" pane, and what happens when they are stopped:

nsx-proxy      ← Interacts with both Policy and CCP on the NSX manager, and nsx-opsagent and nsx-nestdb on host

When stopped, NSX Configuration is listed as "Host Disconnected". Manager connectivity is Down, and Agent Status is not reported.

nsx-sfhc         ← This is the installation agent for NSX deployment that communicates with MP

When stopped, UI shows host with status "Install Failed" and View Details is not available. 

2. Agent Status in NSX UI

 In the 3,2.2+ NSX UI, Host Agent Status for 3 agents is viewed at: System > Fabric > Hosts > View Details on the Host > Monitor > Agent Status

 If one Agent (opsagent, cfgagent, nestdb) is stopped, the Host's overall Status will show as Down

 

3. Commands

To manage agent service from ESX command line:   /etc/init.d/<service name> status/stop/start

'esxcli network ip connection list | grep 1234' and 'esxcli network ip connection list | grep 1235' will show connections to the Managers and Controller with World Name of 'nsx-proxy'     ←  If nsx-proxy is stopped, these connections will not be listed

 

4. Logs

nsx-proxy logs are in nsx-syslog.log. This will show logging for the the nsx-proxy agent connection to the Manager components.

ag -i "nsx-proxy" var/run/log/nsx-syslog.log

 

nsx-proxy heartbeats from MP appear like this:

2022-08-11T04:25:56Z nsx-proxy: NSX 2101571 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2101571" level="INFO"] MessagingClientService: Heartbeat message received in FrameworkUnifiedMsg from endpoint: ssl://10.105.8.11:1234 client_id: a5887a11-c352-426c-951b-b2def1ea4806

2022-08-11T04:26:56Z nsx-proxy: NSX 2101571 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="2101571" level="INFO"] MessagingClientService: Heartbeat message received in FrameworkUnifiedMsg from endpoint: ssl://10.105.8.11:1234 client_id: a5887a11-c352-426c-951b-b2def1ea4806

 

To check for gaps in heartbeats from management plane, run:

ag "Heartbeat message" nsx-syslog* | awk '{print $1}' | cut -d ":" -f3- | sort -V | cut -d":" -f1 | uniq -c

There are usually 60 an hour. A high-level look at heartbeat loss looks like this:

     60 2022-08-20T02

   60 2022-08-20T03

   7 2022-08-20T04    ← only 7 heartbeats from MP this hour 

   32 2022-08-25T00   ← only 32 heartbeats from MP this hour 

   60 2022-08-25T01

   60 2022-08-25T02

 

nsx-nestdb logs are in nsx-syslog.log:

ag -i "nestdb" var/run/log/nsx-syslog.log

 

nsx-cfgagent logs are in nsx-syslog.log:

ag -i "cfgagent" var/run/log/nsx-syslog.log

 

nsx-opsagent logs are in nsx-syslog.log:

ag -i "opsagent" var/run/log/nsx-syslog.log

 

If nsxda cannot connect with nestdb, logging in nsx-syslog.log appears like this:

2022-12-01T18:39:50.058Z nsx-opsagent[2103501]: NSX 2103501 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxda" tid="2105253" level="WARNING"] Waiting for NestDB to connect.

2022-12-01T18:39:53.931Z nsx-opsagent[2103501]: NSX 2103501 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxda" tid="2105252" level="WARNING"] Waiting for NestDB to connect.