How to use different options when using gpbackup with --plugin-config to back up Data Domain
search cancel

How to use different options when using gpbackup with --plugin-config to back up Data Domain

book

Article ID: 296538

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

This article will go through the most common options used with gpbackup with the Data Domain plugin, as well as what these options mean and how they affect the backup process. 

It also summarizes the process of the backup to Data Domain.

Environment

Product Version: 5.28

Resolution

Process

As in all gpbackups, the backup gets the list of tables it needs to backup and start to issue COPY commands with the ON SEGMENT option. 

Each segment starts the copy of the table to the segment and then waits for the next table. 

Segments only move onto the next table when all segments have completed the copy of the last table. Therefore if there is data skew on segments, it may affect the performance of the backup.

For a backup to the DataDomain, the COPY command are sent to a pipe link that is open between the Segment and the Data Domain. 
20220118:09:56:18 gpbackup:gpadmin:gpdb-single-m:015538-[DEBUG]:-Writing data for table public.test12 to file (table 12 of 12)
20220118:09:56:18 gpbackup:gpadmin:gpdb-single-m:015538-[DEBUG]:-Worker 0: COPY public.test12 TO PROGRAM '(test -p "<SEG_DATA_DIR>/gpbackup_<SEGID>_20220118095550_pipe_15538_49367" || (echo "Pipe not found <SEG_DATA_DIR>/gpbackup_<SEGID>_20220118095550_pipe_15538_49367">&2; exit 1)) && cat - > <SEG_DATA_DIR>/gpbackup_<SEGID>_20220118095550_pipe_15538_49367' WITH CSV DELIMITER ',' ON SEGMENT IGNORE EXTERNAL PARTITIONS;
Therefore, all of the data is being read from the Segment and sent to the Data Domain

Deduplication

One of the main benefits of Data Domain is its ability to perform deduplication. 

This means that if you write a file to the Data Domain, it checks if it has that file saved already, if not, it saves the file. If it is all ready saved, it just creates the metadata for it. 

This requires for less storage capacity and less time when writing to the Data Domain. 

For backups from Greenplum, this usually means that for the first backup to Data Domain. The duration can be long, and then the times should start to reduce (as long as most of the data storage has not changed in between backups).

The deduplication check is done on the Data Domain side, so the data is still sent to it. 

If you have a large Database were only few tables are added to, then the backups should be quick. However, if you have a smaller Database, where all the tables were changed frequently, then the backups can be much longer. 

This also means that if you run a backup with any options that might determine the file data (no-compression, signle-data-file, leaf-partition-data) and then run another backup without the same option, then it would be looked at as a new backup, as the data files sent to DD would be different and new data would be written. 


Compression

If you are backing up a table(s) that has compressed data, the backup will uncompress the data by reading it with the COPY command. 

It is advised that the --no-compression option is used because compressed data does not allow DD Boost to do any deduplication.

If you do not use this option, the data will still be uncompressed with the COPY and then compressed when sent, so using this option may actually cause the backup to take longer. 


Single File

We recommend using the --single-file-data option because multiple data files may cause additional overhead on the Data Domain file system, resulting in longer backup times.


Partition Tables

When using the --leaf-partition-data option, it backups the partition tables separately. 

If you don't use this option it will backup all partitions under the parent tables as one table. 

As mentioned above, the backup will only move onto the next table once all the segments have completed the previous table. With this option on, it will split the parent table into partitions. So there can be more waiting on segments to finish than without this option.

Therefore with many partition tables and data skew this option can cause backups to run longer.

The downside is that it can cause restores to take longer and you can only restore parent tables and not partition tables.  


Backup Versions

As the gpbackup and gprestore tools are always improving it is recommended that the latest version is used to avail of all bug fixes and improvements.

Checklist:

Process: 

  • As in all gpbackups, the backup will get the list of tables it needs to backup and start to issue COPY commands with the ON SEGMENT option. 
  • Each segment will start the copy of the table to the segment and then wait for the next table. 
  • Segments will only move onto the next table when all segments have completed the copy of the last table. Therefore if there is data skew on segments it may affect the performance of the backup.
  • For a backup to the DataDomain, the COPY command will be sent to a pipe link that is open between the Segment and the Data Domain. 
20220118:09:56:18 gpbackup:gpadmin:gpdb-single-m:015538-[DEBUG]:-Writing data for table public.test12 to file (table 12 of 12)
20220118:09:56:18 gpbackup:gpadmin:gpdb-single-m:015538-[DEBUG]:-Worker 0: COPY public.test12 TO PROGRAM '(test -p "<SEG_DATA_DIR>/gpbackup_<SEGID>_20220118095550_pipe_15538_49367" || (echo "Pipe not found <SEG_DATA_DIR>/gpbackup_<SEGID>_20220118095550_pipe_15538_49367">&2; exit 1)) && cat - > <SEG_DATA_DIR>/gpbackup_<SEGID>_20220118095550_pipe_15538_49367' WITH CSV DELIMITER ',' ON SEGMENT IGNORE EXTERNAL PARTITIONS;
  • Therefore all of the data will be read from the Segment and sent to the Data Domain


Deduplication: 

  • One of the main benefits of Data Domain is its ability to perform deduplication. 
  • This means that if you write a file to the Data Domain it will check if it has that file saved already, if not, it will save the file. If it is all ready save it will just create the metadata for it. 
  • This will require for less storage capacity and less time when writing to the Data Domain. 
  • For backups from Greenplum, this usually means that for the first backup to Data Domain, the duration can be long, and then the times should start to reduce (as long as most of the data storage has not changed in between backups).
  • The deduplication check is done on the Data Domain side, so the data is still sent to it. 
  • If you have a large Database were only few tables are added to, then the backups should be quick. However, if you have a smaller Database, where all the tables were changed frequently, then the backups can be much longer. 
  • This also means that if you run a backup with any options that might determine the file data (no-compression, signle-data-file, leaf-partition-data) and then run another backup without the same option, then it would be looked at as a new backup, as the data files sent to DD would be different and new data would be written. 

Compression:

  • If you are backing up a table(s) that has compressed data, the backup will uncompress the data by reading it with the COPY command. 
  • It is advised that the "--no-compression" option is used because compressed data does not allow DD Boost to do any deduplication
  • If you do not use this option, the data will still be uncompressed with the COPY and then compressed when sent, so using this option may actually cause the backup to take longer. 

Single File:

  • We recommend using the --single-file-data option because multiple data files may cause additional overhead on the Data Domain file system, resulting in longer backup times.
Partition Tables:
  • When using the --leaf-partition-data option, it will backup the partition tables separately. 
  • If you don't use this option it will backup all partitions under the parent tables as one table. 
  • As mentioned above, the backup will only move onto the next table once all the segments have completed the previous table. With this option on, it will split the parent table into partitions. So there can be more waiting on segments to finish than without this option.
  • Therefore with many partition tables and data skew this option can cause backups to run longer.
  • The downside is that it can cause restores to take longer and you can only restore parent tables and not partition tables.  

Backup Versions:

  • As the gpbackup and gprestore tools are always improving it is recommended that the latest version is used to avail of all bug fixes and improvements.