gprecoverseg - [WARNING]:-Failed to start segment. REASON: PG

Products

VMware Tanzu Greenplum

Issue/Introduction

Gprecoverseg fails with the following error (this can be the same for gpstart):

20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Recovery type              = Standard
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Recovery 1 of 1
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Synchronization mode                        = Incremental
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance host                        =  sdw15 
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance address                     = sdw15cc
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance directory                   = /data2/primary/gpseg77
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance port                        = 1030
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance replication port            = 1094
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance host               = sdw18
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance address            = sdw18cc
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance directory          = /data1/mirror/gpseg77
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance port               = 1153
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance replication port   = 1217
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Target                             = in-place
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-1 segment(s) to recover
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Ensuring 1 failed segment(s) are stopped
20200405:03:12:25:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20200405:03:12:31:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Updating configuration with new mirrors
20200405:03:12:31:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Updating mirrors
20200405:03:12:37:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Starting mirrors
20200405:03:12:37:012055 gprecoverseg:mdw:gpadmin-[INFO]:-era is 05f98c23bc49c962_200405025756
20200405:03:12:38:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
20200405:03:22:44:012055 gprecoverseg:mdw:gpadmin-[INFO]:-Process results...
20200405:03:22:44:012055 gprecoverseg:mdw:gpadmin-[WARNING]:-Failed to start segment.  The fault prober will shortly mark it as down. Segment: sdw15:/data2/primary/gpseg77:content=77:dbid=128:mode=r:status=d: REASON: PG_CTL failed.

Environment

Product Version: 4.3

Resolution

In order to check why the segment did not start, check the startup log. In this situation it is /data2/primary/gpseg77 on sdw15 host,

2020-04-05 03:39:02.807658 EDT,,,p17621,th702179104,,,,0,,,seg-1,,,,,"LOG","00000","removing all temporary files",,,,,,,,"RemovePgTempFiles","fd.c",1873,
2020-04-05 03:39:02.809065 EDT,,,p17621,th702179104,,,,0,,,seg-1,,,,,"LOG","XX000","could not bind IPv4 socket: Address already in use (pqcomm.c:456)",,"Is another postmaster already running on port 1030? If not, wait a few seconds and retry.",,,,,,"StreamServerPort","pqcomm.c",456,
2020-04-05 03:39:02.809984 EDT,,,p17621,th702179104,,,,0,,,seg-1,,,,,"LOG","XX000","could not create IPv6 socket: Address family not supported by protocol (pqcomm.c:390)",,,,,,,,"StreamServerPort","pqcomm.c",390,
2020-04-05 03:39:02.810163 EDT,,,p17621,th702179104,,,,0,,,seg-1,,,,,"WARNING","01000","could not create listen socket for ""*""",,,,,,,,"PostmasterMain","postmaster.c",1361,
2020-04-05 03:39:02.810291 EDT,,,p17621,th702179104,,,,0,,,seg-1,,,,,"FATAL","XX000","could not create any TCP/IP sockets (postmaster.c:1366)",,,,,,,,"PostmasterMain","postmaster.c",1366,1    0xb0d20e postgres errstart (elog.c:502)
2    0x8fed37 postgres PostmasterMain (postmaster.c:1365)
3    0x7feb6f postgres main (main.c:206)
4    0x3186c1ed20 libc.so.6 __libc_start_main + 0x100
5    0x4bee79 postgres <symbol not found> + 0x4bee79

There might be many reasons why segment is unable to start. One of them is shown in the startup.log above - "could not bind IPv4 socket: Address already in use".

There is also a hint in the log "Is another postmaster already running on port 1030?"

We can check if the 1030 port is in use:

[gpadmin@sdw15 pg_log]$  netstat -nalp | grep 1030
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:1030              0.0.0.0:*                   LISTEN      -

As we can see above the port is in fact used by another process, but since we run netstat as gpadmin not as root, we don't know what it is. If we run the same command as root:

[root@sdw15 ~]# netstat -nalp | grep 1030
tcp        0      0 127.0.0.1:1030              0.0.0.0:*                   LISTEN      10741/vnetd

Now we can see it is process 10741 vnetd. To fix the problem this process should be terminated to release port 1030 for a segment. For a permanent solution an OS admin should be informed to reconfigure any service that is using a Greenplum port range to avoid this situation in a future.