Vertica spread process keeps crashing intermittently in CA Performance Management (CAPM)
search cancel

Vertica spread process keeps crashing intermittently in CA Performance Management (CAPM)

book

Article ID: 209641

calendar_today

Updated On:

Products

CA Performance Management - Usage and Administration

Issue/Introduction

The Vertica Spread process keeps crashing intermittently causing one of the 3 nodes in the DB cluster to go down. This can happen several times per week. What is the issue here?

 

Environment

CAPM 3.7 or later

Cause

Seeing the following messages in the vertica.log:

2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273704982 from list of initialized nodes for session v_drdata_node0003-338053:0x27b19

...

2021-02-25 22:22:21.069 Spread Service InOrder Queue:7f7c28bdb700 [VMPI] <INFO> Removing 45035996273705106 from list of initialized nodes for session v_drdata_node0003-338053:0x33da2

Followed by:

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2021-02-25 22:22:21.125343 ExpirationTimestamp: 2089-03-16 01:36:28.125343 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_drdata_node0003 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: drdata Hostname: <host>

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Node left cluster, reassessing k-safety...

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Setting node v_drdata_node0003 to UNSAFE

Then:

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2021-02-25 22:22:21.125588 ExpirationTimestamp: 2089-03-16 01:36:28.125588 EventCodeDescription: Node State Change ProblemDescription: Changing node v_drdata_node0003 startup state to UNSAFE DatabaseName: drdata Hostname: <host>

...

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 [Recover] <INFO> Changing node v_drdata_node0003 startup state from UP to UNSAFE

2021-02-25 22:22:21.125 Spread Service InOrder Queue:7f7c28bdb700 <LOG> @v_drdata_node0003: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2021-02-25 22:22:21.125724 ExpirationTimestamp: 2021-02-25 22:32:21.125724 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: drdata Hostname: <host>

2021-02-25 22:22:21.126 Spread Mailbox Dequeue:7f7c293dc700 [Comms] <INFO> Spread dequeue thread exiting

2021-02-25 22:22:21.129 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> stop: disconnecting #node_c#N137172159016 from spread daemon, Mbox=10

2021-02-25 22:22:21.165 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Vertica pid=338053; found spread pid=338051 from pidfile /opt/CA/catalog/drdata/v_drdata_node0003_catalog/spread.pid

2021-02-25 22:22:21.168 SafetyShutdown:7f7c04ff9700 [Shutdown] <INFO> Shutting down this node

2021-02-25 22:22:21.248 Init Session:7f7c077fe700-c0000009c75f35 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:21.249 Init Session:7f7be6fe5700-c0000009c75f36 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:21.249 Init Session:7f7bebfef700-c0000009c75f33 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:21.249 Init Session:7f7be591f700-c0000009c75f34 <ERROR> @v_drdata_node0003: 08006/4539: Received no response from v_drdata_node0001, v_drdata_node0002 in transaction bind

        LOCATION:  sendCall, /scratch_a/release/svrtar14870/vbuild/vertica/Dist/DistCalls.cpp:16222

2021-02-25 22:22:22.000 DiskSpaceRefresher:7f7c293dc700 [Util] <INFO> Task 'DiskSpaceRefresher' enabled

2021-02-25 22:22:22.179 Spread Service InOrder Queue:7f7c28bdb700 [Comms] <INFO> Spread daemon pid=338051 hasterminated

 

Resolution

In this instance, Node 3 (v_drdata_node0003) is the cause of the problem. Possibly due to the following:

  • network latency
  • disk problems
  • Even time sync between the servers (a single second difference in system time can cause problems - check that they're in sync using NTP or similar).