Edge Agent down Alarm

Products

VMware NSX

Issue/Introduction

Title: Alarm for Edge agent liveliness down
Event ID: edge_health.edge_agent_down

Alarm Description

Purpose: The purpose of this alarm to inform the user that Edge agent is down or busy. A SHA plug-in monitors edge-agent process for every 60 secs of timer interval. Edge-agent is declared as down if it is not responding for more than 56 seconds. CPU load can be one of the main reasons where it can be busy.
Impact: Edge Agent down might lead to fail HA, fail to sync fib entries to data path and fail to configure/maintain L2 and L3 topologies.

Environment

VMware NSX

Resolution

Steps to resolve
For release 4.2.0 and higher

Recommended Action:

Execute the following command.
Component: Edge Transport Node
user: admin
CLI: get service local-controller state

Sample output:

Edge1> get service local-controller state
Thu Dec 21 2023 UTC 06:59:34.273
Uptime: 320587.483 seconds (since 2023-12-17T13:56:26.81)
Full Sync State : Completed at {num: 2, time: 2023-12-18T08:06:39.60}
IPC Channel State
Datapath Config : Up since 2023-12-17T13:57:50.39
Datapath State : Up since 2023-12-17T13:57:50.35
Routing Service : Up since 2023-12-18T06:15:08.73
BFD Config : None
BFD State : None

If the CLI succeeds, it might be a transient problem where the CPU load might be high.
If CLI fails (no output), continue to next check.

Check the following command in root shell of Edge Transport Node to check whether edge-agent is running or not.
Component: Edge Transport Node
user: root
CLI: ps auxww | grep edge-agent

Sample output:

root@Edge1:~# ps aux. | grep edge-agent
nsxa 2797 0.0 0.4 133039232968 ? Ssl Dec17 3:24 /opt/vmware/nsx-edge/bin/edge-agent --no-chdir --unixctl=/var/run/vmware/edge/nsxa.ctl --pidfile=/var/run/vmware/edge/nsxa.pid -vconsole:err -vsyslog:info --syslog-method=udp:127.0.0.1
root 2586883 0.0 0.0 6776 2216 pts/0 S+ 06:58 0:00 grep --color=auto edge-agent.

If edge-agent is not listed in the above output, start edge-agent process using the following CLI command.
Component: Edge Transport Node
user: admin
CLI: start service local-controller

Sample Output:

Edge1> start service local-controller
Edge1>.

Run the following command in root shell of Edge transport node and check if edge-agent is still generating syslog. If it still has syslog from edge-agent, edge-agent might be busy on some tasks and not responding to CLI.
Component: Edge Transport Node
user: root
CLI: tail -f /var/log/syslog | grep subcomp=‘nsxa’

Sample Output:

root@Edge1:~# tail -f /var/log/syslog | grep nsxa
2023-12-19T12:14:10.040Z Edge1 NSX 1 FABRIC [nsx@6876 comp=‘nsx-edge’ subcomp=‘nsxa’ s2comp=‘ha-cluster’ level=‘INFO’] HA port b1e57b81-ac62-4ad0-91db-ffafe1a09457 IP 169.254.0.2/24 type 2
2023-12-19T12:14:10.040Z Edge1 NSX 1 FABRIC [nsx@6876 comp=‘nsx-edge’ subcomp=‘nsxa’ s2comp=‘ha-cluster’ level=‘INFO’] HA port b1e57b81-ac62-4ad0-91db-ffafe1a09457 IP 169.254.0.3/24 type 2

If not, edge-agent might be in bad state. In either case, collect support bundle and restart edge-agent by the following CLI command.
Component: Edge Transport Node
user : admin
CLI: restart service local-controller

Sample output:

Edge1>
Edge1> restart service local-controller
Edge1> .

Maintenance window required for remediation? No