Data Repository doesn't start and epoch after outage is far in the past

Products

Network Observability CA Performance Management

Issue/Introduction

Data Repository does not start after "RollBack Database to Last Good Epoch" with error: "Epoch number must be 'last' or between..."

Tried starting Vertica Database with --force option. It does not start

Tried starting from "RollBack Database to Last Good Epoch" configuration from Advanced menu and also tried starting database manually but getting following error:

[dradmin@dr v_drdata_node0001_catalog]$ /opt/vertica/bin/adminTools -t restart_db -d drdata -e 2390915
Info: no password specified, using none
Invalid value for last good epoch: '2390915'
Epoch number must be 'last' or between 2394299 and 2390915 inclusive

When looking at the Epoch log, the following is shown:

[dradmin@dr v_drdata_node0001_catalog]$ less Epoch.log

Last good epoch: 0x247b83 ended at '2000-01-01 13:00:00+13'
Last good catalog version: 0x3c2be6
K-safety: 0
AHM: 0x2488bb ended at '2022-11-19 08:23:22.549881+13

Environment

All supported DX NetOps Performance Management releases

Cause

You have to use the for the AHM (Ancient History Mark) in the Epoch.log. In the following example, this would be 0x2488bb. This converts to 2394299 in decimal

Last good epoch: 0x247b83 ended at '2000-01-01 13:00:00+13'
Last good catalog version: 0x3c2be6
K-safety: 0
AHM: 0x2488bb ended at '2022-11-19 08:23:22.549881+13

However, if Last Good Epoch is different, then this is a problem. If epoch/AHM is good, then those two date/times should be the same. But the 'Last good epoch" is 0x247b83 (2390915 in decimal) 1/1/2000 which is the dawn of time for Vertica.

So the DB not starting is due to AHM being greater than Last good epoch (LGE) instead of the same. This is due to some projection holding the epoch LGE.

Resolution

The following steps should resolve it:

First, start the db in unsafe mode

admintools -u <dbAdminUser> -t start_db -d drdata --unsafe

Where <dbAdminUser> is the dradmin or equivalent OS user that administers the DB.

This should result in a message "Database <dbName>: Startup Succeeded. All Nodes are UP"
Enter vSQL:

/opt/vertica/bin/vsql -h localhost drdata
Run the following query to find the list of projections behind AHM.

SELECT e.node_name, t.table_schema, t.table_name, e.projection_schema, e.projection_name, checkpoint_epoch FROM projection_checkpoint_epochs e, projections p, tables t WHERE e.projection_id = p.projection_id and p.anchor_table_id = t.table_id and not (is_temp_table) and is_behind_ahm and e.is_up_to_date;

This should output something similar to:

node_name | table_schema | table_name | projection_schema | projection_name | checkpoint_epoch

-------------------+--------------+--------------------+-------------------+---------------------------------------+------------------

v_drdata_node0001 | dragg | attribute_instance | dragg | attribute_instance_super_2 | 2390915

v_drdata_node0001 | dragg | attribute_instance | dragg | attribute_instance_attribute_id_first | 2390915

(2 rows)
Then run;

select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch() AHM,(get_current_epoch() - get_last_good_epoch()) CeLGDiff,(get_last_good_epoch() - get_ahm_epoch()) LgeAHmDiff ,get_ahm_time(), get_expected_recovery_epoch();
This should output something like:

INFO 4544: Recovery Epoch Computation:

Node Dependencies:

1 - cnt: 219

1 - name: v_drdata_node0001

Nodes certainly in the cluster:

Node 0(v_drdata_node0001), epoch 2390915

Filling more nodes to satisfy node dependencies:

Data dependencies fulfilled, remaining nodes LGEs don't matter:

--

CE | LGE | AHM | CeLGDiff | LgeAHmDiff | get_ahm_time | get_expected_recovery_epoch

---------+---------+---------+----------+------------+-------------------------------------------------+-----------------------------

2394301 | 2390915 | 2394299 | 3386 | -3384 | Current AHM Time: 2022-11-19 08:23:22.549881+13 | 2390915

(1 row)
Once you find the projection holding the epoch < AHM, then disable it by running the below query. Replace <table_schema> and <projection_name> with the values seen in the output from the command in step 3 above.

SELECT do_tm_task('abortrecovery', '<table_schema>.<projection_name>');

In this instance it would be two commands to run.

SELECT do_tm_task('abortrecovery', 'dragg.attribute_instance_super_2');
SELECT do_tm_task('abortrecovery', 'dragg.attribute_instance_attribute_id_first');
Confirm current LGE again.

select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch() AHM,(get_current_epoch() - get_last_good_epoch()) CeLGDiff,(get_last_good_epoch() - get_ahm_epoch()) LgeAHmDiff ,get_ahm_time(), get_expected_recovery_epoch();
Shutdown the database which started in unsafe mode.

drdata=> select shutdown('true');

Wait for several minutes until the database stops

9. Start the DB normally using adminTools.

Additional Information

https://support.microfocus.com/kb/kmdoc.php?id=KM03806541