Vertica fails to start following an unclean shutdown, failing during the catalog read process with a Python "KeyError: 'oid'"
SYMPTOMS:
Database fails to start with "Unable to read database catalogs".
AdminTools "Roll Back Database to Last Good Epoch" fails because Epoch.log is missing.
Forced start (admintools -t start_db -F) exits early during catalog initialization.
AdminTools log shows a Traceback in compute_vdatabase.py at self.oid = nodedeets['oid'].
Database: Vertica 23.4.0.12
Product: NetOps 24.3.1
Setup: Single Vertica node in a Disaster Recovery cluster: 2 identical Data Repository environments
PREREQUISITES:
Administrator access (dradmin).
Access to a redundant or healthy Vertica node if performing a copy cluster.
STEPS:
1. ASSESS CATALOG STATUS:
Verify if the catalog directory is empty or if the port is listening.
Command: ss -atupn
EXPECTED: Port is not listening and /Catalog may be empty if corrupted.
2. ATTEMPT CATALOG RECOVERY (OPTIONAL):
Try a forced start to attempt metadata recovery.
Command: /opt/vertica/bin/admintools -t start_db -d [database_name] -F
NOTE: If this fails with 'KeyError: oid', the catalog is likely too corrupted for local recovery.
3. PERFORM COPY CLUSTER FROM REDUNDANT NODE:
If a redundant node or healthy environment exists, use copycluster to restore the catalog and data.
Command: /opt/vertica/bin/vbr.py --task copycluster --config-file /opt/vertica/config/copycluster.ini
EXPECTED: Data syncs to the destination cluster and reinitializes the catalog.
4. COMPLETE REBUILD (IF PREVIOUS STEPS FAIL):
If copycluster fails with "Catalog bootstrap failed", a complete rebuild of the database may be required.
Action: Back up and delete existing /data and /catalog directories, then re-run the restore or copycluster process.
VERIFY SUCCESS:
Run /opt/vertica/bin/admintools -t list_allnodes.
Confirm the Node State is UP.
It is not recommended to run Data Repository on a single Vertica node.
Run at least 3 nodes cluster to provide redundancy.