Symptoms:
gprecoverseg with verbose fails with the error message "ValueError: Invalid Literal for int() with Base 10" and no specific/evident error seen on the master, primary and mirror logs:
When the database is restarted, gpstop throws a WARNING:
[WARNING]:-Unable to clean shared memory ('NoneType' object has no attribute 'rc')
As postmaster.pid file exists on a problematic segment, gprecoverseg is unable to complete the recovery operation. All the postgres related processes including the postmaster.pid file should be stopped/removed for the failed segment before it can go into the state of recovery.
1. Find the Unclean Shared Memory Segment using KB article How Greenplum cleans up the shared memory
2. Locate the data directory for the failed segments:
gpstate:sdw2:gpadmin-[INFO]:- Segment Port Config status Status gpstate:sdw2:gpadmin-[INFO]:- sdw2 50003 Up Unknown -- unable to load segment status gpstate:sdw2:gpadmin-[INFO]:- sdw2 50005 Up Unknown -- unable to load segment status gpstate:sdw2:gpadmin-[INFO]:- sdw2 40001 Up Unknown -- unable to load segment status gpstate:sdw2:gpadmin-[INFO]:- sdw2 40003 Up Unknown -- unable to load segment status
3. SSH to the corresponding segment server. Check if the postmaster.pid file still exists in the data directory of the failed segments that are marked down in gp_segment_configuration.
[gpadmin@sdw2 base]$ ssh bdtcstr21n14 [gpadmin@sdw2 data4]$ ls -ltrh gpseg1*/postmas* -rw------- 1 gpadmin gpadmin 22 Nov 10 09:38 gpseg135/postmaster.pid -rw------- 1 gpadmin gpadmin 157 Nov 10 09:38 gpseg135/postmaster.opts -rw------- 1 gpadmin gpadmin 22 Nov 10 09:38 gpseg143/postmaster.pid -rw------- 1 gpadmin gpadmin 22 Nov 10 09:38 gpseg141/postmaster.pid -rw------- 1 gpadmin gpadmin 157 Nov 10 09:38 gpseg143/postmaster.opts -rw------- 1 gpadmin gpadmin 157 Nov 10 09:38 gpseg141/postmaster.opts
4. Remove postmaster.pid files that still exist on failed segments.
Note: Do NOT remove postmaster.pid file from valid segments that are up and running.