Monitor Data Repository node processes for node outage
search cancel

Monitor Data Repository node processes for node outage

book

Article ID: 217325

calendar_today

Updated On:

Products

DX NetOps CA Performance Management - Usage and Administration

Issue/Introduction

When the DX NetOps Performance Management (PM) Data Repository database node(s) go down, how can we monitor the Vertica processes?

What Vertica processes on the nodes for a Data Repository database cluster will go down when a node leaves the cluster?

Monitoring Vertica processes for a down node.

Environment

All supported DX NetOps Performance Management releases

Cause

Need to be alerted when a database node goes down.

Resolution

The following samples were taken from node0001 in a three node cluster. Default install paths shown.

Process list for a running node.

[root@node0001_HostName ~]# ps -ef | grep vertica
dradmin    7618      1  0 Apr30 ?        00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dradmin/agent.conf
dradmin    7637   7618  0 Apr30 ?        09:24:20 /opt/vertica/oss/python/bin/python ./simply_fast.py
dradmin   37070      1  0 Apr30 ?        02:37:42 /opt/vertica/spread/sbin/spread -c /loddisk/data/drdata/v_drdata_node0001_catalog/spread.conf -D /opt/vertica/spread/tmp
dradmin   37072      1  6 Apr30 ?        2-22:38:35 /opt/vertica/bin/vertica -D /loddisk/data/drdata/v_drdata_node0001_catalog -C drdata -n v_drdata_node0001 -h <node0001_IP_Address> -p 5433 -P 4803 -Y ipv4 -S 10263370
dradmin   37107  37072  0 Apr30 ?        00:19:45 /opt/vertica/bin/vertica-udx-zygote 13 3 37072 debug-log-off /loddisk/data/drdata/v_drdata_node0001_catalog/UDxLogs 60 14 0

Process list for the same node with the database down on the node.

[root@node0001_HostName ~]# ps -ef | grep vertica
dradmin    7618      1  0 Apr30 ?        00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dradmin/agent.conf
dradmin    7637   7618  0 Apr30 ?        09:24:21 /opt/vertica/oss/python/bin/python ./simply_fast.py

The missing spread, vertica, and vertica-udx-zygote services indicate the database is down on the node.

If using process monitoring tools, the missing processes can be used as a trigger to indicate the database is down on that node.

Additional Information

An alternative to monitoring the processes is working with Events available in PM by default. These can be set up to send emails to users when they are raised via Notification Rules in PM.

The following are the currently available system based Data Repository State Events from the Data Aggregator Data Source.

Create Notification "Data Repository State" 

Select Next and Next

In Data Source select:
CA Performance Center
Data Aggregator@...

 

 

Select Event Type

Select Next

You can enable the Email or/and Sent Trap

 

Note that in a single node database being down, by the time the Data Aggregator Event engine recognizes the problem and tries to raise the default Events it may already be shutting down due to loss of the database. That would result in no Events being raised.