Monitor Data Repository node processes for node outage

Products

Network Observability CA Performance Management

Issue/Introduction

When the DX NetOps Performance Management (PM) Data Repository database node(s) go down, how can we monitor the Vertica processes?

What Vertica processes on the nodes for a Data Repository database cluster will go down when a node leaves the cluster?

Monitoring Vertica processes for a down node.

Environment

All supported DX NetOps Performance Management releases

Cause

Need to be alerted when a database node goes down.

Resolution

The following samples were taken from node0001 in a three node cluster. Default install paths shown.

Process list for a running node.

[root@node0001_HostName ~]# ps -ef | grep vertica
dradmin 7618 1 0 Apr30 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dradmin/agent.conf
dradmin 7637 7618 0 Apr30 ? 09:24:20 /opt/vertica/oss/python/bin/python ./simply_fast.py
dradmin 37070 1 0 Apr30 ? 02:37:42 /opt/vertica/spread/sbin/spread -c /loddisk/data/drdata/v_drdata_node0001_catalog/spread.conf -D /opt/vertica/spread/tmp
dradmin 37072 1 6 Apr30 ? 2-22:38:35 /opt/vertica/bin/vertica -D /loddisk/data/drdata/v_drdata_node0001_catalog -C drdata -n v_drdata_node0001 -h <node0001_IP_Address> -p 5433 -P 4803 -Y ipv4 -S 10263370
dradmin 37107 37072 0 Apr30 ? 00:19:45 /opt/vertica/bin/vertica-udx-zygote 13 3 37072 debug-log-off /loddisk/data/drdata/v_drdata_node0001_catalog/UDxLogs 60 14 0

Process list for the same node with the database down on the node.

[root@node0001_HostName ~]# ps -ef | grep vertica
dradmin 7618 1 0 Apr30 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dradmin/agent.conf
dradmin 7637 7618 0 Apr30 ? 09:24:21 /opt/vertica/oss/python/bin/python ./simply_fast.py

The missing spread, vertica, and vertica-udx-zygote services indicate the database is down on the node.

If using process monitoring tools, the missing processes can be used as a trigger to indicate the database is down on that node.

Additional Information

An alternative to monitoring the processes is working with Events available in PM by default. These can be set up to send emails to users when they are raised via Notification Rules in PM.

The following are the currently available system based Data Repository State Events from the Data Aggregator Data Source.

Create Notification "Data Repository State"

Select Next and Next

In Data Source select:

CA Performance Center

Data Aggregator@...

Select Event Type

Select Next

You can enable the Email or/and Sent Trap

Note that in a single node database being down, by the time the Data Aggregator Event engine recognizes the problem and tries to raise the default Events it may already be shutting down due to loss of the database. That would result in no Events being raised.