Vertica spread process keeps crashing intermittently in CA Performance Management (CAPM)

book

Article ID: 209641

calendar_today

Updated On:

Products

CA Performance Management - Usage and Administration

Issue/Introduction

The Vertica Spread process keeps crashing intermittently causing one of the 3 nodes in the DB cluster to go down. This can happen several times per week. What is the issue here?

 

Cause

Seeing the following messages in the vertica.log:

2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273704982 from list of initialized nodes for session v_drdata_node0003-338053:0x27b19

...

2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273705106 from list of initialized nodes for session v_drdata_node0003-338053:0x33da2

Followed by:

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2021-02-25 22:22:21.125343 ExpirationTimestamp: 2089-03-16 01:36:28.125343 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_drdata_node0003 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: drdata Hostname: Hostserver3

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Node left cluster, reassessing k-safety...

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Setting node v_drdata_node0003 to UNSAFE

Then:

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2021-02-25 22:22:21.125588 ExpirationTimestamp: 2089-03-16 01:36:28.125588 EventCodeDescription: Node State Change ProblemDescription: Changing node v_drdata_node0003 startup state to UNSAFE DatabaseName: drdata Hostname: Hostserver3

...

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Changing node v_drdata_node0003 startup state from UP to UNSAFE

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2021-02-25 22:22:21.125724 ExpirationTimestamp: 2021-02-25 22:32:21.125724 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: drdata Hostname: Hostserver3

2021-02-25 22:22:21.126 Spread Mailbox Dequeue:7f7c293dc700 [Comms] <INFO> Spread dequeue thread exiting

2021-02-25 22:22:21.129 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> stop: disconnecting #node_c#N137172159016 from spread daemon, Mbox=10

2021-02-25 22:22:21.165 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Vertica pid=338053; found spread pid=338051 from pidfile /opt/CA/catalog/drdata/v_drdata_node0003_catalog/spread.pid

2021-02-25 22:22:21.168 SafetyShutdown:7f7c04ff9700 [Shutdown] <INFO> Shutting down this node

2021-02-25 22:22:21.248 Init Session:7f7c077fe700-c0000009c75f35 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:21.249 Init Session:7f7be6fe5700-c0000009c75f36 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:21.249 Init Session:7f7bebfef700-c0000009c75f33 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:21.249 Init Session:7f7be591f700-c0000009c75f34 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:22.000 DiskSpaceRefresher:7f7c293dc700 [Util] <INFO> Task 'DiskSpaceRefresher' enabled

2021-02-25 22:22:22.179 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Spread daemon pid=338051 hasterminated

 

Environment

CAPM 3.7 or later

Resolution

In this instance, Node 3 (v_drdata_node0003) is the cause of the problem. Possibly due to the following:

  • network latency
  • disk problems
  • Even time sync between the servers (a single second difference in system time can cause problems - check that they're in sync using NTP or similar).