The Vertica Spread process keeps crashing intermittently causing one of the 3 nodes in the DB cluster to go down. This can happen several times per week. What is the issue here?
CAPM 3.7 or later
Seeing the following messages in the vertica.log:
2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273704982 from list of initialized nodes for session v_drdata_node0003-338053:0x27b19
...
2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273705106 from list of initialized nodes for session v_drdata_node0003-338053:0x33da2
Followed by:
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2021-02-25 22:22:21.125343 ExpirationTimestamp: 2089-03-16 01:36:28.125343 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_drdata_node0003 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: drdata Hostname: <host>
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Node left cluster, reassessing k-safety...
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Setting node v_drdata_node0003 to UNSAFE
Then:
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2021-02-25 22:22:21.125588 ExpirationTimestamp: 2089-03-16 01:36:28.125588 EventCodeDescription: Node State Change ProblemDescription: Changing node v_drdata_node0003 startup state to UNSAFE DatabaseName: drdata Hostname: <host>
...
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Changing node v_drdata_node0003 startup state from UP to UNSAFE
2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2021-02-25 22:22:21.125724 ExpirationTimestamp: 2021-02-25 22:32:21.125724 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: drdata Hostname: <host>
2021-02-25 22:22:21.126 Spread Mailbox Dequeue:7f7c293dc700 [Comms] <INFO> Spread dequeue thread exiting
2021-02-25 22:22:21.129 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> stop: disconnecting #node_c#N137172159016 from spread daemon, Mbox=10
2021-02-25 22:22:21.165 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Vertica pid=338053; found spread pid=338051 from pidfile /opt/CA/catalog/drdata/v_drdata_node0003_catalog/spread.pid
2021-02-25 22:22:21.168 SafetyShutdown:7f7c04ff9700 [Shutdown] <INFO> Shutting down this node
2021-02-25 22:22:21.248 Init Session:7f7c077fe700-c0000009c75f35 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:21.249 Init Session:7f7be6fe5700-c0000009c75f36 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:21.249 Init Session:7f7bebfef700-c0000009c75f33 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:21.249 Init Session:7f7be591f700-c0000009c75f34 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind
LOCATION: sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222
2021-02-25 22:22:22.000 DiskSpaceRefresher:7f7c293dc700 [Util] <INFO> Task 'DiskSpaceRefresher' enabled
2021-02-25 22:22:22.179 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Spread daemon pid=338051 hasterminated
In this instance, Node 3 (v_drdata_node0003) is the cause of the problem. Possibly due to the following: