Edge Agent down Alarm
search cancel

Edge Agent down Alarm

book

Article ID: 330451

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Title: Alarm for Edge agent liveliness down
Event ID: edge_health.edge_agent_down

Alarm Description

  • Purpose: The purpose of this alarm to inform the user that Edge agent is down or busy. A SHA plug-in monitors edge-agent process for every 60 secs of timer interval. Edge-agent is declared as down if it is not responding for more than 56 seconds. CPU load can be one of the main reasons where it can be busy.
  • Impact: Edge Agent down might lead to fail HA, fail to sync fib entries to data path and fail to configure/maintain L2 and L3 topologies.

Environment

VMware NSX

Resolution

Steps to resolve
For release 4.2.0 and higher

Recommended Action:

  • Execute the following command.
    Component: Edge Transport Node
    user: admin
    CLI: get service local-controller state

Sample output:

Edge1> get service local-controller state
Thu Dec 21 2023 UTC 06:59:34.273
Uptime: 320587.483 seconds (since 2023-12-17T13:56:26.81)
Full Sync State : Completed at {num: 2, time: 2023-12-18T08:06:39.60}
IPC Channel State
Datapath Config : Up since 2023-12-17T13:57:50.39
Datapath State : Up since 2023-12-17T13:57:50.35
Routing Service : Up since 2023-12-18T06:15:08.73
BFD Config : None
BFD State : None

If the CLI succeeds, it might be a transient problem where the CPU load might be high.
If CLI fails (no output), continue to next check.

  • Check the following command in root shell of Edge Transport Node to check whether edge-agent is running or not.
    Component: Edge Transport Node
    user: root
    CLI: ps auxww | grep edge-agent

Sample output:

root@Edge1:~# ps aux. | grep edge-agent
nsxa 2797 0.0 0.4 133039232968 ? Ssl Dec17 3:24 /opt/vmware/nsx-edge/bin/edge-agent --no-chdir --unixctl=/var/run/vmware/edge/nsxa.ctl --pidfile=/var/run/vmware/edge/nsxa.pid -vconsole:err -vsyslog:info --syslog-method=udp:127.0.0.1
root 2586883 0.0 0.0 6776 2216 pts/0 S+ 06:58 0:00 grep --color=auto edge-agent.

  • If edge-agent is not listed in the above output, start edge-agent process using the following CLI command.
    Component: Edge Transport Node
    user: admin
    CLI: start service local-controller

Sample Output:

Edge1> start service local-controller
Edge1>.

  • Run the following command in root shell of Edge transport node and check if edge-agent is still generating syslog. If it still has syslog from edge-agent, edge-agent might be busy on some tasks and not responding to CLI.
    Component: Edge Transport Node
    user: root
    CLI: tail -f /var/log/syslog | grep subcomp=‘nsxa’

Sample Output:

root@Edge1:~# tail -f /var/log/syslog | grep nsxa
2023-12-19T12:14:10.040Z Edge1 NSX 1 FABRIC [nsx@6876 comp=‘nsx-edge’ subcomp=‘nsxa’ s2comp=‘ha-cluster’ level=‘INFO’] HA port b1e57b81-ac62-4ad0-91db-ffafe1a09457 IP 169.254.0.2/24 type 2
2023-12-19T12:14:10.040Z Edge1 NSX 1 FABRIC [nsx@6876 comp=‘nsx-edge’ subcomp=‘nsxa’ s2comp=‘ha-cluster’ level=‘INFO’] HA port b1e57b81-ac62-4ad0-91db-ffafe1a09457 IP 169.254.0.3/24 type 2

  • If not, edge-agent might be in bad state. In either case, collect support bundle and restart edge-agent by the following CLI command.
    Component: Edge Transport Node
    user : admin
    CLI: restart service local-controller

Sample output:

Edge1>
Edge1> restart service local-controller
Edge1> .

Maintenance window required for remediation? No