Data Repository cluster regularly shuts down unexpectedly.
search cancel

Data Repository cluster regularly shuts down unexpectedly.

book

Article ID: 8334

calendar_today

Updated On:

Products

CA Infrastructure Management CA Performance Management - Usage and Administration

Issue/Introduction

Data Repository (DR) Database (DB) stability issues are observed in a three node cluster. The DB is found down frequently with no known user interaction. Despite this condition the DB is able to be restarted without error after each outage.

Environment

All supported DX NetOps Performance Management releases

Cause

In a Vertica DB cluster it will shut itself down if the majority of nodes are seen as down. In a standard three node cluster this means if 2/3 nodes are seen as down it will protect the remaining node known to be running by shutting it down. 

In this case we observe messages in the vertica.log files for the nodes that point to network disconnects between nodes as a cause.

This message is observed from the node0001 vertica.log: 

2017-09-30 01:48:59.174 Spread Client:0x93d8550 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0003 left the cluster 

This then aligns with this message in the node0003 vertica.log file: 

2017-09-30 01:48:59.173 Spread Client:0x832f9d0 [Comms] <INFO> NETWORK change with 2 VS sets 

Looking in the node0001 vertica.log we see similar messages for node0002:

2017-09-30 01:48:59.173 Spread Client:0x93d8550 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster 

Further in the node0002 log file the same message as we see in the node0003 log: 

2017-09-30 01:48:59.172 Spread Client:0x9019170 [Comms] <INFO> NETWORK change with 2 VS sets 

That leaves us with a Vertica system which sees node0001 as the lone remaining member of a three node cluster. This violates k-safety and triggers the shutdown cycle.

2017-09-30 01:48:59.174 Spread Client:0x93d8550 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes

Resolution

The internal network team will need to determine why the nodes losing network connectivity to each other at these times.

Once the cause is identified and the network problems between cluster nodes is resolved the outages should cease.