Data loss as a result of a data partition (disk) failure in the Data Repository.

book

Article ID: 48588

calendar_today

Updated On:

Products

CA Infrastructure Performance

Issue/Introduction

Description:

Root Cause:

Issue with 3rd Party Software - Vertica Database
When a data partition fails on a single node in a clustered Data Repository server, the Vertica process on that node continues to run (it should shut down). The Data Aggregator continues to transact with this node and receives a large number of transaction failures which prevent the Data Aggregator from loading data into any of the nodes in the cluster, resulting in data loss.

Impact:

This impacts our high availability solution - The reason we went to a clustered Data Repository model is so that when one node fails, we can continue to interact with the remaining nodes and prevent any data loss. It is this smooth failover that does not occur when the problem described above is encountered.

To date, no customers have run into this issue. The current GA version (IM 2.2) uses the same version of Vertica (6.0.2).

Symptoms:

  • Synchronization failures between the CA Performance Center and Data Aggregator
  • CA Performance Center reports may show "No Data to display" in all Data Aggregator views
  • Data Aggregator process is down
  • You may see errors such as the following when this problem occurs in the DA's karaf.out and karaf.log files:

ERROR: Insufficient resources to execute plan on pool general [Timedout waiting for resource request: Request exceeds limits: Memory(KB) Exceeded: Requested = 7274517, Free = 6705162 (Limit = 58273960, Used = 51568798)]

Solution:

Workaround:

Vertica is aware of this issue and provided a script to identify a partition problem and shut down that particular Vertica node.

Instructions:

  1. Copy the check_dir.sh script to each data repository node under the "/opt/vertica" directory.

  2. As the root linux user, edit the crontab by running:

    crontab -e

  3. Add the following line to the crontab:

    */5 * * * * /opt/vertica/check_dir.sh <db_admin_user> <catalog_dir> <data_dir> > /opt/vertica/check_dir_output.txt

    for example:

    */5 * * * * /opt/vertica/check_dir.sh dradmin /catalog /data > /opt/vertica/check_dir_output.txt

    This will run the check_dir script every 5 minutes to check for the ability of the dradmin user to read and write to the data and catalog directories.

  4. Save your changes and exit crontab.

    Under the data and catalog directories a file named "diskHeartbeat" will be created. This file will contain the last date and time that we were able to successfully write to the disk.

Environment

Release:
Component: IMAGGR