The copy cluster was getting the following rsync errors in the vbr log.
VbrError: On host 10.73.87.137: Error accessing remote storage: failed accessing remote storage on hostname009: rsync: read errors mapping "/data/drdata/v_drdata_node0002_data/169/02f4d452012bc54adfb7d8d989929a2900b000060592f871_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/673/02f4d452012bc54adfb7d8d989929a2900b0000605b2abb1_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/743/02f4d452012bc54adfb7d8d989929a2900b0000605b261bf_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/753/02f4d452012bc54adfb7d8d989929a2900b0000605b261c9_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/169/02f4d452012bc54adfb7d8d989929a2900b000060592f871_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/673/02f4d452012bc54adfb7d8d989929a2900b0000605b2abb1_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/743/02f4d452012bc54adfb7d8d989929a2900b0000605b261bf_0.gt": Input/output error (5)
rsync: read errors mapping "/CA/data/drdata/v_drdata_node0002_data/753/02f4d452012bc54adfb7d8d989929a2900b0000605b261c9_0.gt": Input/output error (5)
ERROR: 169/02f4d452012bc54adfb7d8d989929a2900b000060592f871_0.gt failed verification -- update discarded.
ERROR: 673/02f4d452012bc54adfb7d8d989929a2900b0000605b2abb1_0.gt failed verification -- update discarded.
ERROR: 743/02f4d452012bc54adfb7d8d989929a2900b0000605b261bf_0.gt failed verification -- update discarded.
ERROR: 753/02f4d452012bc54adfb7d8d989929a2900b0000605b261c9_0.gt failed verification -- update discarded.
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]
: returncode=23
Checked the /var/log/messages on the source system
Mar 19 22:35:38 txanunxlipcp004 kernel: end_request: I/O error, dev sdb, sector 468409304
Mar 19 22:36:27 txanunxlipcp004 kernel: megaraid_sas 0000:03:00.0: 1251514 (701058948s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 02(e0x20/s2) at f8b3371
Mar 19 22:36:37 txanunxlipcp004 kernel: megaraid_sas 0000:03:00.0: 1251520 (701058959s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 02(e0x20/s2) at f8b3371
Mar 19 22:36:39 txanunxlipcp004 kernel: megaraid_sas 0000:03:00.0: 1251522 (701058961s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 02(e0x20/s2) at f8b33a8
Mar 19 22:36:49 txanunxlipcp004 kernel: megaraid_sas 0000:03:00.0: 1251528 (701058971s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 02(e0x20/s2) at f8b3371
Mar 19 22:36:51 txanunxlipcp004 kernel: megaraid_sas 0000:03:00.0: 1251530 (701058973s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 02(e0x20/s2) at f8b33a8
This indicated that there was a disk issue on the source system.
We then shutdown the database on that node and set the Ancient History Mark (AHM) in the database. To rebuild the node. Hopefully bypassing the bad parts of the disk so the copy cluster could run.
Release : 20.2
Component :
Stop Vertica on the problem node using the following command:
/opt/vertica/bin/adminTools -t stop_node --hosts xx.xx.xx.xx
Run the following vsql command to set the Ancient History Mark (AHM) in the database. First enter the vsql prompt, then run the command, then quit to exist the prompt.
/opt/vertica/bin/vsql -U <dbAdminUser> -w <dbAdminPassword>
select make_ahm_now(true);
vsql>\q
Restart the database on the problem node with the following command:
/opt/vertica/bin/admintools -t restart_node -d drdata --hosts x.x.x.x
KB to rebuild the problem Vertica node.
https://knowledge.broadcom.com/external/article?articleId=8078