Vertica database went down due to network communications issue

Products

CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

How can we determine if the cause of a Vertica outage was network-related?

Vertica database was found down and restarted without further error. Why was it down? The system wasn't rebooted or patched.

Messages like the examples seen here were found in the vertica.log files on the nodes. These are from the node0001 vertica.log file in a three-node cluster.

The problem presents itself first with these messages:

2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw membership message 8192 (0x2000) on V:drdata
2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw transitional message; watch for lost daemons
2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw membership message 8192 (0x2000) on Vertica:all
2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw transitional message; watch for lost daemons
2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw membership message 8192 (0x2000) on Vertica:join
2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw transitional message; watch for lost daemons
2024-01-22 01:02:42.299 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Saw membership message 6144 (0x1800) on V:drdata
2024-01-22 01:02:42.311 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> NETWORK change with 1 VS sets
2024-01-22 01:02:42.311 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> VS set #0 (mine) has 1 members (offset=24)
2024-01-22 01:02:42.325 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> VS set #0, member 0: #node_a#N010064099001
2024-01-22 01:02:42.357 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> DB Group changed
2024-01-22 01:02:42.357 Spread Service InOrder Queue:0x7ff8d86f7700 [VMPI] <INFO> DistCall: Set current group members called with 1 members

Then we would see something like this for one of the other nodes.

2024-01-22 01:02:42.360 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster

Then the other node(s) leave:

2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0003 left the cluster
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Running hooks after detecting a node loss
...
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Node left cluster, reassessing k-safety...
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Setting node v_drdata_node0001 to UNSAFE

Lastly, this is seen before the DB shuts itself down.

2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0001 left the cluster
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Running hooks after detecting a node loss
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Node left cluster, reassessing k-safety...
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Node v_drdata_node0001 erased from nodeToState map
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -18

NOTE: Be sure to check rolled-back logs with the date appended to the file name to find the offending messages.

We see similar messages in all nodes. Using the message time stamps we can determine which nodes left first, showing where the issue began.

Environment

All supported DX NetOps Performance Management releases

Cause

Network-related communication failures between nodes.

Resolution

Network-related connection issues between Vertica nodes, even if short-lived or intermittent, will often result in a down database. The Vertica DB, when configured properly for k-safety, will shut itself down when enough members of the cluster are seen as lost.

This is often, in a stable network, a rare occurrence.

If network-related issues are consistently causing an outage, the Vertica tools could be utilized to determine if the network meets requirements. If it doesn't, ensuring changes are made to meet requirements may resolve the problem. The vnetperf tool is useful when determining if network requirements are being met. See the Run Data Repository Diagnostic Utilities documentation topic for more information.

If the system meets requirements and consistent network-related outages are still seen, engage your internal network teams for further assistance.