Data Repository database backup fails with rsync error
search cancel

Data Repository database backup fails with rsync error

book

Article ID: 100907

calendar_today

Updated On:

Products

CA Infrastructure Management CA Performance Management - Usage and Administration

Issue/Introduction

When running the Vertica Data Repository backup it fails and reports the following error message.

Error: Timed out connecting to rsync daemon on host <nodeHostName>.
Backup FAILED.

Environment

All supported CA Performance Management releases

Cause

A common cause for this is related to previously failed attempts to run the backup job. When it fails there are often broken rsync connections on the database cluster nodes.

These rsync connections are started by a parent process. When that process dies and breaks the connection the rsync processes are left with a child PID but a parent PID that no longer exists.

This leaves these processes in a disowned zombie state where they require manual shut down.

These prevent new rsync connections from being successfully created and trigger the error message observed.

Resolution

Use either of these commands to list and identify the existing rsync processes on each node in the database cluster. 
  • ps -ef | grep rsync 
  • lsof -i :50000 -S 
Do we see rsync processes that have a child, but not a parent, PID listed? If yes continue. If no there may be other causes yet to be identified. Please open a support case for additional investigation.

If we see rsync processes that have no Parent ID but do have a Child PID use the 'kill -9 <PID>' command, entering the child PID of the zombie rsync process to shut it down. Run the above ps or lsof commands to confirm it's been stopped and removed from the results.

Select cases have shown rsync processes with both a Parent PID and Child PID, as well as showing parameters tying it to the /opt/vertica/bin directory. There should be no rsync processes left running if there is no database backup actively running. If we see this and see that:
  • The Parent PID remains and wasn't removed but is invalid. Does the parent PID exist against another process?
  • There is no active database backup running.
It is appropriate to kill those processes as well using the 'kill -9 <PID>' command.

Additional Information

Pay special attention in a multi-node cluster to each member of the cluster. If one cluster is missed and still has zombie rsync processes left over the backup will fail again.