The copy cluster between our production and backup vertica instance now fails after upgrading to the 10.1.1.20 version. This has been working perfectly since back on version 9, the script has been untouched. Now we get these vauge error messages and the copy cluster fails.
I can run the copy cluster and it always suggests that on one of the 5 nodes (error is random to which one) is missing a critical file, example.
]# cat copy_cluster.log
stop the db
Database drdata stopped successfully
sync the db
Error: Missing critical file: [XX.XX.XX.XXX]:/opt/catalog/drdata/v_drdata_node0005_catalog/Snapshots/Copy_drdata.txt
Copycluster FAILED.
Starting copy of database drdata.
Participating nodes: v_drdata_node0001, v_drdata_node0002, v_drdata_node0003, v_drdata_node0004, v_drdata_node0005.
Snapshotting database.
Snapshot complete.
However in this case it says node5 is missing the file, I can go to that node and see that the file indeed does exist.
# cd /opt/catalog/drdata/v_drdata_node0005_catalog/Snapshots/
# ll
total 649780
-rw------- 1 dradmin verticadba 642861002 Mar 20 09:00 Copy_drdata.ctlg
-rw------- 1 dradmin verticadba 11996782 Mar 20 09:00 Copy_drdata.files
-rw------- 1 dradmin verticadba 10500768 Mar 20 09:00 Copy_drdata.manifest
-rw------- 1 dradmin verticadba 5284 Mar 20 09:00 Copy_drdata.txt
-rw------- 1 dradmin verticadba 0 Mar 20 09:00 Copy_drdata.udfs
I've have run this multiple times and the error is always the same other than it will say node5 then node3, node4, ect.
Dx NetOps Performance Management 22.2
Defect in the Vertica vbr.py script
The fix was to change this line in the /opt/vertica/bin/vbr.py script.
Change this line from:
session_host, db_paths[next(iter(self._participating_nodes))], snap_name)
To this:
session_host, db_paths[init_node], snap_name)