Data Repository node 'data' volume out of disk space not causing the node to go in status DOWN

Products

CA Performance Management Network Observability

Issue/Introduction

Data Repository 3 Nodes cluster

All 3 nodes in Data Repository cluster show their status UP, but only one of the nodes (i.e. Node00003) had problems with the disk space. At some point the node file system was 100% full.
Customer expectation was that the out of space node should have been shutdown and set to status DOWN, so that Data Aggregator could not reach it anymore, and switch to one of the other running nodes, and keep writing data to the database. Instead, the affected node kept answering to the DA's heartbeats check.

In vertica.log can see these type of messages:

....
2021-03-07 02:00:14.934 Init Session:7f5515ff9700 [VMPI] <INFO> TMOperation: Error on Moveout: (Table: ehealth.nrm_power_env_sensor_rate) (Projection: ehealth.nrm_power_env_sensor_rate_super_seg_b0): Could not write to [/opt/vertica/data/capm/v_capm_node0003_data]: Volume [/opt/vertica/data/capm/v_capm_node0003_data] has insufficient space.

2021-03-07 02:00:14.934 Init Session:7f5515ff9700 <ERROR> @v_capm_node0003: 53100/2927: Could not write to [/opt/vertica/data/capm/v_capm_node0003_data]: Volume [/opt/vertica/data/capm/v_capm_node0003_data] has insufficient space.

2021-03-07 02:00:18.774 TM Moveout:7f5696ffd700 <ERROR> @v_capm_node0003: {threadShim} 53100/2927: Could not write to [/opt/vertica/data/capm/v_capm_node0003_data]: Volume [/opt/vertica/data/capm/v_capm_node0003_data] has insufficient space.
LOCATION: reserveSpace, /scratch_a/release/svrtar14870/vbuild/vertica/Workload/ResourceManager.cpp:4816
....

So there is clearly a disk space issue on this node, and it is with the 'data' volume.

Under this condition, is the affected node expected to shutdown (go to DOWN state) and leave the load to the other nodes, instead of keep responding to DA heartbeat requests?

Environment

Release : Any

Component : IM Data Storage

Cause

The DA heartbeat command does individual calls to each node in the cluster to check their UP status. It doesn't check to see if the disk space is full on the node.
As long as the heartbeat we do to the database comes back, DA assume the node is up and can be sent commands to run.

Resolution

According to Vertica advice, having the 'catalog' disk volume full would cause the node to shutdown, while with the 'data' volume disk full, the node can stay up.

When the storage of the 'data' disk gets full, we can see some warnings in the logs, and maybe some queries failed, however the database and the node can be still operating with those messages.
However, when the 'catalog' gets full, then the errors are more severe due the information that it is contained in the 'catalog' is critical for its functionality, and the node can crash and go for those errors.

Refer to the information form link below:
https://www.vertica.com/docs/10.1.x/HTML/Content/Authoring/AdministratorsGuide/ManageDiskSpace/ManagingDiskSpace.htm?Highlight=Managing%20Disk%20Space

Managing Disk Space
Vertica detects and reports low disk space conditions in the log file so you can address the issue before serious problems occur. It also detects and reports low disk space conditions via SNMP traps if enabled.

Critical disk space issues are reported sooner than other issues. For example, running out of catalog space is fatal; therefore, Vertica reports the condition earlier than less critical conditions. To avoid database corruption when the disk space falls beyond a certain threshold, Vertica begins to reject transactions that update the catalog or data.

In conclusion, this is expected functionality:

when the 'data' volume is full, we can see errors in logs, but the node will stay UP.
when the 'catalog' volume is full, the node is expected to go DOWN.