A client using gpbackup version 1.30.5 to perform large-scale full backups, ranging from 90TB to 432TB, which are stored on a Data Domain system. These backups are then replicated to a disaster recovery (DR) site using the ddboost plugin. The replication process involves transferring the entire backup from the production Data Domain to another Data Domain at the DR site, with both systems located in different data centers.
The replication jobs are managed by gpbackup_manager version 1.8.3, but they're experiencing sporadic failures. These failures are not consistent and only occur occasionally. When a failure does happen, it's typically detected and logged by the ddboost plugin after the job has been running for several hours.
In the majority of cases, a single restart is enough to complete the replication successfully. However, in some instances, the team may need to initiate a second restart to ensure the process completes without issues.
GPDB 6.27.0
gpbackup 1.30.5
gpbackup_manager version 1.8.3
Network instability
To troubleshoot gpbackup_manager replication issues, you should check the following logs:
gpbackup_manager logs:
Located in the default log directory (usually /home/gpadmin/gpAdminLogs)
Look for files named gpbackup_manager_<timestamp>.log
DD Boost plugin logs:
Check for gpbackup_ddboost_plugin.log in the log directory
Verbose logging output:
Run gpbackup_manager with the --verbose flag to get more detailed error information
Example: gpbackup_manager replicate-backup --plugin-config /path/to/config.yaml --timestamp YYYYMMDDHHMMSS --verbose
Job history in Replication Monitor:
If using SQL Server, check the Replication Monitor for detailed agent histories
Data Domain system logs:
Review logs on both source and target Data Domain systems for replication-related issues
Network logs:
Check network logs between the primary and DR Data Domain systems for connectivity issues
Steps to simulate network instability between data domains.
1. Setup 2 data domains with replication.
2. Populate DDomain 1 with a large amount of data.
3. Start replication from DDomain 1 to DDomain 2
4. Kill DDomain 2 to simulate network flakiness.