The Vertica Spread process keeps crashing intermittently causing one of the 3 nodes in the DB cluster to go down. This can happen several times per week. What is the issue here?
Seeing the following messages in the vertica.log:
2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273704982 from list of initialized nodes for session v_drdata_node0003-338053:0x27b19
...
2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273705106 from list of initialized nodes for session v_drdata_node0003-338053:0x33da2
Followed by:
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2021-02-25 22:22:21.125343 ExpirationTimestamp: 2089-03-16 01:36:28.125343 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_drdata_node0003 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: drdata Hostname: Hostserver3
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Node left cluster, reassessing k-safety...
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Setting node v_drdata_node0003 to UNSAFE
Then:
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2021-02-25 22:22:21.125588 ExpirationTimestamp: 2089-03-16 01:36:28.125588 EventCodeDescription: Node State Change ProblemDescription: Changing node v_drdata_node0003 startup state to UNSAFE DatabaseName: drdata Hostname: Hostserver3
...
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Changing node v_drdata_node0003 startup state from UP to UNSAFE
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2021-02-25 22:22:21.125724 ExpirationTimestamp: 2021-02-25 22:32:21.125724 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: drdata Hostname: Hostserver3
2021-02-25 22:22:21.126 Spread Mailbox Dequeue:7f7c293dc700 [Comms] <INFO> Spread dequeue thread exiting
2021-02-25 22:22:21.129 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> stop: disconnecting #node_c#N137172159016 from spread daemon, Mbox=10
2021-02-25 22:22:21.165 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Vertica pid=338053; found spread pid=338051 from pidfile /opt/CA/catalog/drdata/v_drdata_node0003_catalog/spread.pid
2021-02-25 22:22:21.168 SafetyShutdown:7f7c04ff9700 [Shutdown] <INFO> Shutting down this node
2021-02-25 22:22:21.248 Init Session:7f7c077fe700-c0000009c75f35 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:21.249 Init Session:7f7be6fe5700-c0000009c75f36 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:21.249 Init Session:7f7bebfef700-c0000009c75f33 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:21.249 Init Session:7f7be591f700-c0000009c75f34 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:22.000 DiskSpaceRefresher:7f7c293dc700 [Util] <INFO> Task 'DiskSpaceRefresher' enabled
2021-02-25 22:22:22.179 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Spread daemon pid=338051 hasterminated
CAPM 3.7 or later
In this instance, Node 3 (v_drdata_node0003) is the cause of the problem. Possibly due to the following: