GPRECOVERSEG Fails with Error: "Cannot Write: No Space Left on Device"
search cancel

GPRECOVERSEG Fails with Error: "Cannot Write: No Space Left on Device"

book

Article ID: 294699

calendar_today

Updated On:

Products

Services Suite

Issue/Introduction

Symptoms:

When using gprecoverseg to recover segments, the following error is: "Cannot write: No space left on device."

20170126:11:19:30:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -i /tmp/gprecoverseg -F
(...)
20170126:11:19:50:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-2 segment(s) to recover
20170126:11:19:50:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Ensuring 2 failed segment(s) are stopped
...
20170126:11:19:54:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Cleaning files from 2 segment(s)
.........
20170126:11:20:03:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Building template directory
20170126:11:20:03:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Creating template
20170126:11:20:04:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Starting copy of segment dbid 2 to location /tmp/GPSQL/gpsql_template20170126_112003
20170126:11:21:10:120429 gprecoverseg:hawqmaster:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 2
 Command was: '/bin/tar -C /tmp/GPSQL/gpsql_template20170126_112003 -xf /tmp/GPSQL/gpsql_template20170126_112003/hawq_template20170126_112004'
rc=2, stdout='', stderr='/bin/tar: ./pg_distributedlog/016F: Wrote only 7680 of 10240 bytes
/bin/tar: ./pg_distributedlog/0170: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0171: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0172: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0173: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0174: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0175: Cannot write: No space left on device

(...)

/bin/tar: ./postgresql.conf: Cannot write: No space left on device
/bin/tar: ./postmaster.pid: Cannot write: No space left on device
/bin/tar: Exiting with failure status due to previous errors
'
Traceback (most recent call last):
 File "/usr/local/hawq/ext/python/lib/python2.6/logging/__init__.py", line 769, in emit
 stream.write(fs % msg)
IOError: [Errno 28] No space left on device

The / partition will be 100% full:

[root@hawq21 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5  9.8G 9.8G    0G  100%  /
tmpfs      2.9G   0   2.9G    0%  /dev/shm
/dev/sda1  477M 41M   411M    9%  /boot
/dev/sda7  55G  22G    31G   41%  /data
/dev/sda2  20G  45M    19G    1%  /home
/dev/sda3 9.8G  24M   9.2G    1%  /tmp
[root@hawq21 ~]#

Environment


Cause

  • When running gprecoverseg with HAWQ 1.x, the master will copy the whole segment directory from one of the running segments into the master's /tmp directory to create a template.
  • In the above example, DBID 2 was chosen as the template segment to copy the data from.
  • The template will be compressed and copied to /tmp/GPSQL/gpsql_template<TIMESTAMP>.
  • The template is uncompressed and pg_log and other directories are removed. As the uncompressed size may be large, this may lead to the "out of space" errors.
  • If X segments are being recovered, the contents will be untarred X amount of times into the /tmp directory which may increase the risk of running into the "out of space" error.

Resolution

Workaround

Make sure there is enough space left on / compared to the size of the segment directory being chosen to copy from. If there is not enough free space, move log files out of the pg_log directory on running segment that the files are being copied from and use du -sh ./* to understand where space is being used.

Once gprecoverseg is complete, the log files can be placed back on the pg_log directory on the source segment.

Alternatively, segments can be recovered in smaller groups instead of all of them at a time with gprecoverseg -i <file>