Data Repository does not start after "RollBack Database to Last Good Epoch" with error: "Epoch number must be 'last' or between..."
Tried starting Vertica Database with --force option. It does not start
Tried starting from "RollBack Database to Last Good Epoch" configuration from Advanced menu and also tried starting database manually but getting following error:
[dradmin@dr v_drdata_node0001_catalog]$ /opt/vertica/bin/adminTools -t restart_db -d drdata -e 2390915
Info: no password specified, using none
Invalid value for last good epoch: '2390915'
Epoch number must be 'last' or between 2394299 and 2390915 inclusive
When looking at the Epoch log, the following is shown:
[dradmin@dr v_drdata_node0001_catalog]$ less Epoch.log
Last good epoch: 0x247b83 ended at '2000-01-01 13:00:00+13'
Last good catalog version: 0x3c2be6
K-safety: 0
AHM: 0x2488bb ended at '2022-11-19 08:23:22.549881+13
All supported DX NetOps Performance Management releases
You have to use the for the AHM (Ancient History Mark) in the Epoch.log. In the following example, this would be 0x2488bb
. This converts to 2394299 in decimal
Last good epoch: 0x247b83 ended at '2000-01-01 13:00:00+13'
Last good catalog version: 0x3c2be6
K-safety: 0
AHM: 0x2488bb ended at '2022-11-19 08:23:22.549881+13
However, if Last Good Epoch is different, then this is a problem. If epoch/AHM is good, then those two date/times should be the same. But the 'Last good epoch" is 0x247b83
(2390915 in decimal) 1/1/2000 which is the dawn of time for Vertica.
So the DB not starting is due to AHM being greater than Last good epoch (LGE) instead of the same. This is due to some projection holding the epoch LGE.
The following steps should resolve it:
admintools -u <dbAdminUser> -t start_db -d drdata --unsafe
/opt/vertica/bin/vsql -h localhost drdata
SELECT e.node_name, t.table_schema, t.table_name, e.projection_schema, e.projection_name, checkpoint_epoch FROM projection_checkpoint_epochs e, projections p, tables t WHERE e.projection_id = p.projection_id and p.anchor_table_id = t.table_id and not (is_temp_table) and is_behind_ahm and e.is_up_to_date;
node_name | table_schema | table_name | projection_schema | projection_name | checkpoint_epoch
-------------------+--------------+--------------------+-------------------+---------------------------------------+------------------
v_drdata_node0001 | dragg | attribute_instance | dragg | attribute_instance_super_2 | 2390915
v_drdata_node0001 | dragg | attribute_instance | dragg | attribute_instance_attribute_id_first | 2390915
(2 rows)
Then run; select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch() AHM,(get_current_epoch() - get_last_good_epoch()) CeLGDiff,(get_last_good_epoch() - get_ahm_epoch()) LgeAHmDiff ,get_ahm_time(), get_expected_recovery_epoch();
This should output something like:INFO 4544: Recovery Epoch Computation:
Node Dependencies:
1 - cnt: 219
1 - name: v_drdata_node0001
Nodes certainly in the cluster:
Node 0(v_drdata_node0001), epoch 2390915
Filling more nodes to satisfy node dependencies:
Data dependencies fulfilled, remaining nodes LGEs don't matter:
--
CE | LGE | AHM | CeLGDiff | LgeAHmDiff | get_ahm_time | get_expected_recovery_epoch
---------+---------+---------+----------+------------+-------------------------------------------------+-----------------------------
2394301 | 2390915 | 2394299 | 3386 | -3384 | Current AHM Time: 2022-11-19 08:23:22.549881+13 | 2390915
(1 row)
SELECT do_tm_task('abortrecovery', '<table_schema>.<projection_name>');
SELECT do_tm_task('abortrecovery', 'dragg.attribute_instance_super_2');
SELECT do_tm_task('abortrecovery', 'dragg.attribute_instance_attribute_id_first');
Confirm current LGE again.
select get_current_epoch() CE,get_last_good_epoch() LGE,get_ahm_epoch() AHM,(get_current_epoch() - get_last_good_epoch()) CeLGDiff,(get_last_good_epoch() - get_ahm_epoch()) LgeAHmDiff ,get_ahm_time(), get_expected_recovery_epoch();
Wait for several minutes until the database stops
9. Start the DB normally using adminTools.