How can we determine if the cause of a Vertica outage was network-related?
Vertica database was found down and restarted without further error. Why was it down? The system wasn't rebooted or patched.
Messages like the examples seen here were found in the vertica.log files on the nodes. These are from the node0001 vertica.log file in a three-node cluster.
The problem presents itself first with these messages:
Then we would see something like this for one of the other nodes.
2024-01-22 01:02:42.360 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0002 left the cluster
Then the other node(s) leave:
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0003 left the cluster
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Running hooks after detecting a node loss
...
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Node left cluster, reassessing k-safety...
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2024-01-22 01:02:42.496 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Setting node v_drdata_node0001 to UNSAFE
Lastly, this is seen before the DB shuts itself down.
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> nodeSetNotifier: node v_drdata_node0001 left the cluster
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Running hooks after detecting a node loss
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Recover] <INFO> Node left cluster, reassessing k-safety...
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> Node v_drdata_node0001 erased from nodeToState map
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
2024-01-22 01:02:43.694 Spread Service InOrder Queue:0x7ff8d86f7700 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -18
NOTE: Be sure to check rolled-back logs with the date appended to the file name to find the offending messages.
We see similar messages in all nodes. Using the message time stamps we can determine which nodes left first, showing where the issue began.
All supported DX NetOps Performance Management releases
Network-related communication failures between nodes.
Network-related connection issues between Vertica nodes, even if short-lived or intermittent, will often result in a down database. The Vertica DB, when configured properly for k-safety, will shut itself down when enough members of the cluster are seen as lost.
This is often, in a stable network, a rare occurrence.
If network-related issues are consistently causing an outage, the Vertica tools could be utilized to determine if the network meets requirements. If it doesn't, ensuring changes are made to meet requirements may resolve the problem. The vnetperf tool is useful when determining if network requirements are being met. See the Run Data Repository Diagnostic Utilities documentation topic for more information.
If the system meets requirements and consistent network-related outages are still seen, engage your internal network teams for further assistance.