Title: Alarm for Edge agent liveliness down
Event ID: edge_health.edge_agent_down
Alarm Description
Steps to resolve
For release 4.2.0 and higher
Recommended Action:
get service local-controller state
Sample output:
Edge1> get service local-controller state
Thu Dec 21 2023 UTC 06:59:34.273
Uptime: 320587.483 seconds (since 2023-12-17T13:56:26.81)
Full Sync State : Completed at {num: 2, time: 2023-12-18T08:06:39.60}
IPC Channel State
Datapath Config : Up since 2023-12-17T13:57:50.39
Datapath State : Up since 2023-12-17T13:57:50.35
Routing Service : Up since 2023-12-18T06:15:08.73
BFD Config :
NoneBFD State : None
If the CLI succeeds, it might be a transient problem where the CPU load might be high.
If CLI fails (no output), continue to next check.
root
ps auxww | grep edge-agent
Sample output:
root@Edge1:~# ps aux. | grep edge-agent
nsxa 2797 0.0 0.4 133039232968 ? Ssl Dec17 3:24 /opt/vmware/nsx-edge/bin/edge-agent --no-chdir --unixctl=/var/run/vmware/edge/nsxa.ctl --pidfile=/var/run/vmware/edge/nsxa.pid -vconsole:err -vsyslog:info --syslog-method=udp:127.0.0.1
root 2586883 0.0 0.0 6776 2216 pts/0 S+ 06:58 0:00 grep --color=auto edge-agent.
admin
start service local-controller
Sample Output:Edge1> start service local-controller
Edge1>.
root
tail -f /var/log/syslog | grep subcomp=‘nsxa’
Sample Output:
root@Edge1:~# tail -f /var/log/syslog | grep nsxa
2023-12-19T12:14:10.040Z Edge1 NSX 1 FABRIC [nsx@6876 comp=‘nsx-edge’ subcomp=‘nsxa’ s2comp=‘ha-cluster’ level=‘INFO’] HA port b1e57b81-ac62-4ad0-91db-ffafe1a09457 IP 169.254.0.2/24 type 2
2023-12-19T12:14:10.040Z Edge1 NSX 1 FABRIC [nsx@6876 comp=‘nsx-edge’ subcomp=‘nsxa’ s2comp=‘ha-cluster’ level=‘INFO’] HA port b1e57b81-ac62-4ad0-91db-ffafe1a09457 IP 169.254.0.3/24 type 2
restart service local-controller
Sample output:
Edge1>
Edge1> restart service local-controller
Edge1> .
Maintenance window required for remediation? No