Vertica is stopping for no apparent reason and does so repeatedly over a number of days? While it has restarted without issue:
[dradmin@dr]$ ./admintools -t list_allnodes
Node | Host | State | Version | DB
-------------------+----------------+-------+-----------------+--------
v_drdata_node0001 | xxx.xxx.xxx.xx9 | UP | vertica-9.1.1.5 | drdata
v_drdata_node0002 | xxx.xxx.xxx.xx0 | UP | vertica-9.1.1.5 | drdata
v_drdata_node0003 | xxx.xxx.xxx.xx1 | UP | vertica-9.1.1.5 | drdata
It stops again without any apparent cause. What is causing this as it brings the system to a halt when the DA can no longer function?
DX NetOps CAPM Release : 20.2
This is not something in Vertica itself but is a network problem, resulting in loss of connection between the nodes and the cluster going down, as per the following in the vertica log:
v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> NETWORK change with 1 VS sets
v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> VS set #0 (mine) has 2 members (offset=36)
v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> VS set #0, member 0: #node_a#N172020110139
v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:15:52.808 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> VS set #0, member 1: #node_c#N172020110141
v_drdata_node0001/normal/vertica.log-20211118:2021-11-18 02:19:44.978 Spread Service InOrder Queue:7ff50fc98700 [Comms] <INFO> NETWORK change with 2 VS sets
This can happen due to network delays or temporary pauses in a Virtual Environment lasting longer than the Spread Timeout(8 seconds) causing the nodes to leave the cluster due to the lack of that communication between the nodes.
This issue does not happen too often in a healthy environment, so it can be considered an isolated issue, however, you can check the Spread Best Practices:
Vertica KB : Spread Configuration Best Practices
Another alternative, but only if you are in Vertica 9.2.1 or higher (Vertica is upgraded to 10.x in CAPM 21.2.3 and later). In the later versions, you can modify the “Spread Timeout” to a higher value in order to mitigate this kind of error, using the following query in Vertica:
SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');
You can check if the value was changed using the following query:
SELECT * FROM V_MONITOR.SPREAD_STATE;
In the current situation, you will need to find out what is occurring on the network or the VM system that causes the nodes to lose connection. Perhaps some maintenance task is being run on the VMs or vCenter or the Network itself?