search cancel

Vertica stops without warning on the CA Performance Management (CAPM) Data Repository (DR)

book

Article ID: 229196

calendar_today

Updated On:

Products

CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

Vertica is stopping for no apparent reason and does so repeatedly over a number of days? While it has restarted without issue:

[[email protected]]$ ./admintools -t list_allnodes

 Node              | Host           | State | Version         | DB

-------------------+----------------+-------+-----------------+--------

 v_drdata_node0001 | xxx.xxx.xxx.xx9 | UP    | vertica-9.1.1.5 | drdata

 v_drdata_node0002 | xxx.xxx.xxx.xx0 | UP    | vertica-9.1.1.5 | drdata

 v_drdata_node0003 | xxx.xxx.xxx.xx1 | UP    | vertica-9.1.1.5 | drdata

 It stops again without any apparent cause. What is causing this as it brings the system to a halt when the DA can no longer function?

Cause

This is not something in Vertica itself but is a network problem, resulting in loss of connection between the nodes and the cluster going down, as per the following in the vertica log:

v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> NETWORK change with 1 VS sets

v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> VS set #0 (mine) has 2 members (offset=36)

v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> VS set #0, member 0: #node_a#N172020110139

v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> VS set #0, member 1: #node_c#N172020110141

v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:19:44.978 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> NETWORK change with 2 VS sets

This can happen due to network delays or temporary pauses in a Virtual Environment lasting longer than the Spread Timeout(8 seconds) causing the nodes to leave the cluster due to the lack of that communication between the nodes.

Environment

DX NetOps CAPM Release : 20.2 

Resolution

This issue does not happen too often in a healthy environment, so it can be considered an isolated issue, however, you can check the Spread Best Practices:
       
Vertica KB : Spread Configuration Best Practices

Another alternative, but only if you are in Vertica 9.2.1 or higher (Vertica is upgraded to 10.x in CAPM 21.2.3 and later). In the later versions, you can modify the “Spread Timeout” to a higher value in order to mitigate this kind of error, using the following query in Vertica:

SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');

You can check if the value was changed using the following query:

SELECT * FROM V_MONITOR.SPREAD_STATE;

In the current situation, you will need to find out what is occurring on the network or the VM system that causes the nodes to lose connection. Perhaps some maintenance task is being run on the VMs or vCenter or the Network itself?