When the DX NetOps Performance Management (PM) Data Repository database node(s) go down, how can we monitor the Vertica processes?
What Vertica processes on the nodes for a Data Repository database cluster will go down when a node leaves the cluster?
Monitoring Vertica processes for a down node.
All supported DX NetOps Performance Management releases
Need to be alerted when a database node goes down.
The following samples were taken from node0001 in a three node cluster. Default install paths shown.
Process list for a running node.
[root@node0001_HostName ~]# ps -ef | grep vertica
dradmin 7618 1 0 Apr30 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dradmin/agent.conf
dradmin 7637 7618 0 Apr30 ? 09:24:20 /opt/vertica/oss/python/bin/python ./simply_fast.py
dradmin 37070 1 0 Apr30 ? 02:37:42 /opt/vertica/spread/sbin/spread -c /loddisk/data/drdata/v_drdata_node0001_catalog/spread.conf -D /opt/vertica/spread/tmp
dradmin 37072 1 6 Apr30 ? 2-22:38:35 /opt/vertica/bin/vertica -D /loddisk/data/drdata/v_drdata_node0001_catalog -C drdata -n v_drdata_node0001 -h <node0001_IP_Address> -p 5433 -P 4803 -Y ipv4 -S 10263370
dradmin 37107 37072 0 Apr30 ? 00:19:45 /opt/vertica/bin/vertica-udx-zygote 13 3 37072 debug-log-off /loddisk/data/drdata/v_drdata_node0001_catalog/UDxLogs 60 14 0
Process list for the same node with the database down on the node.
[root@node0001_HostName ~]# ps -ef | grep vertica
dradmin 7618 1 0 Apr30 ? 00:00:00 /bin/bash /opt/vertica/agent/agent.sh /opt/vertica/config/users/dradmin/agent.conf
dradmin 7637 7618 0 Apr30 ? 09:24:21 /opt/vertica/oss/python/bin/python ./simply_fast.py
The missing spread, vertica, and vertica-udx-zygote services indicate the database is down on the node.
If using process monitoring tools, the missing processes can be used as a trigger to indicate the database is down on that node.
An alternative to monitoring the processes is working with Events available in PM by default. These can be set up to send emails to users when they are raised via Notification Rules in PM.
The following are the currently available system based Data Repository State Events from the Data Aggregator Data Source.
Create Notification "Data Repository State"
Select Next and Next
Select Event Type
Select Next
You can enable the Email or/and Sent Trap
Note that in a single node database being down, by the time the Data Aggregator Event engine recognizes the problem and tries to raise the default Events it may already be shutting down due to loss of the database. That would result in no Events being raised.