search cancel

Data Repository does not start after "RollBack Database to Last Good Epoch" with error: "Epoch number must be 'last' or between..."

book

Article ID: 255898

calendar_today

Updated On:

Products

DX NetOps CA Performance Management - Usage and Administration CA Infrastructure Performance

Issue/Introduction

Tried starting Vertica Database with --force option. It does not start

Tried starting from "RollBack Database to Last Good Epoch" configuration from Advanced menu and also tried starting database manually but getting following error:

[[email protected] v_drdata_node0001_catalog]$ /opt/vertica/bin/adminTools -t restart_db -d drdata -e 2390915
Info: no password specified, using none
Invalid value for last good epoch: '2390915'
Epoch number must be 'last' or between 2394299 and 2390915 inclusive

When looking at the Epoch log, the following is shown:

[[email protected] v_drdata_node0001_catalog]$ less Epoch.log

Last good epoch: 0x247b83 ended at '2000-01-01 13:00:00+13'
Last good catalog version: 0x3c2be6
K-safety: 0
AHM: 0x2488bb ended at '2022-11-19 08:23:22.549881+13

Environment

DX NetOps CAPM Release : 22.2

Cause

You have to use the for the AHM (Ancient History Mark) in the Epoch.log. In the following example, this would be 0x2488bb. This converts to 2394299 in decimal

Last good epoch: 0x247b83 ended at '2000-01-01 13:00:00+13'
Last good catalog version: 0x3c2be6
K-safety: 0
AHM: 0x2488bb ended at '2022-11-19 08:23:22.549881+13

However, if Last Good Epoch is different, then this is a problem. If epoch/AHM is good, then those two date/times should be the same. But the 'Last good epoch" is 0x247b83 (2390915 in decimal) 1/1/2000 which is the dawn of time for Vertica.

So the DB not starting is due to AHM being greater than Last good epoch (LGE) instead of the same.  This is due to some projection holding the epoch LGE. 

 

Resolution

The following steps should resolve it:

  1. First, start the db in unsafe mode

    admintools -t start_db -d -U

  2. Enter vSQL:

    /opt/vertica/bin/vsql -h localhost drdata

  3. Run the following query to find the list of projections behind AHM.

    SELECT e.node_name, t.table_schema, t.table_name, e.projection_schema, e.projection_name, checkpoint_epoch FROM projection_checkpoint_epochs e, projections p, tables t WHERE e.projection_id = p.projection_id and p.anchor_table_id = t.table_id and not (is_temp_table) and is_behind_ahm and e.is_up_to_date;

    This should output something similar to:

       node_name       | table_schema |     table_name     | projection_schema |            projection_name            | checkpoint_epoch

    -------------------+--------------+--------------------+-------------------+---------------------------------------+------------------

    v_drdata_node0001  | dragg        | attribute_instance | dragg             | attribute_instance_super_2            |          2390915

    v_drdata_node0001  | dragg        | attribute_instance | dragg             | attribute_instance_attribute_id_first |          2390915

    (2 rows)

  4. Then run;

    select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch() AHM,(get_current_epoch() - get_last_good_epoch()) CeLGDiff,(get_last_good_epoch() - get_ahm_epoch()) LgeAHmDiff ,get_ahm_time(), get_expected_recovery_epoch();

    This should result in:

    INFO 4544:  Recovery Epoch Computation:

  5. Node Dependencies:

    1 - cnt: 219

    1 - name: v_drdata_node0001

    Nodes certainly in the cluster:

            Node 0(v_drdata_node0001), epoch 2390915

    Filling more nodes to satisfy node dependencies:

    Data dependencies fulfilled, remaining nodes LGEs don't matter:

    --

       CE    |   LGE   |   AHM   | CeLGDiff | LgeAHmDiff |                  get_ahm_time                   | get_expected_recovery_epoch

    ---------+---------+---------+----------+------------+-------------------------------------------------+-----------------------------

    2394301 | 2390915 | 2394299 |     3386 |      -3384 | Current AHM Time: 2022-11-19 08:23:22.549881+13 |                     2390915

    (1 row)

  6. Once you find the projection holding the epoch < AHM, then disable it by running the below query:

    SELECT do_tm_task('abortrecovery', '');
  7. Confirm current LGE again.
     
    select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch() AHM,(get_current_epoch() - get_last_good_epoch()) CeLGDiff,(get_last_good_epoch() - get_ahm_epoch()) LgeAHmDiff ,get_ahm_time(), get_expected_recovery_epoch();

  8. Shutdown the database which started in unsafe mode.

         drdata=> select shutdown('true');

               
              Wait for several minutes until the database stops

     9. Start the DB in normal mode.

Additional Information

https://support.microfocus.com/kb/kmdoc.php?id=KM03806541

 

Attachments