gprecoverseg fails to start up segment instances for recovery giving the following error "Address already in use" in the segment log file:
2014-03-25 08:04:42.549867 IST,"gpadmin","template1",p29722,th2145324448,"172.28.12.250","63543",2014-03-25 08:04:42 IST,94211700,con157232,cmd2,seg95,,dx429781,x94211700,sx1,"LOG","00000","duration: 0.863 ms",,,,,,"SET client_min_messages TO 'ERROR'",0,,"postgres.c",1806, 2014-03-25 08:05:01.316974 IST,,,p28737,th2145324448,,,,0,,,seg-1,,,,,"LOG","00000","filerep main process (PID 5681) exited with exit code 0",,,,,,,0,,"postmaster.c",5810, 2014-03-25 08:05:01.327313 IST,,,p29755,th2145324448,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) 'sdw16-2(41005)' mirror address(port) 'sdw15-1(51002)'",,,,,"mirroring role 'primary role' mirroring state 'resync' segment state 'not initialized' process name(pid) 'filerep main process(29755)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3440, 2014-03-25 08:05:01.329499 IST,,,p29756,th2145324448,,,,0,,,seg-1,,,,,"LOG","XX000","could not bind IPv4 socket: Address already in use (pqcomm.c:447)",,"Is another postmaster already running on port 41005? If not, wait a few seconds and retry.",,,,,0,,"pqcomm.c",447, 2014-03-25 08:05:01.329546 IST,,,p29756,th2145324448,,,,0,,,seg-1,,,,,"WARNING","XX000","could not start listener, host:'sdw16-2' port:'41005': Address already in use (cdbfilerepconnserver.c:65)",,,,,"mirroring role 'primary role' mirroring state 'resync' segment state 'transition to resync' process name(pid) 'primary receiver ack process(29756)' filerep state 'initialization and recovery' ",,0,,"cdbfilerepconnserver.c",65, 2014-03-25 08:05:01.329635 IST,,,p28737,th2145324448,,,,0,,,seg-1,,,,,"WARNING","01000","PostmasterPrimaryMirrorTransition (4) Finished with Error",,,,,,,0,,"primary_mirror_mode.c",1698, 2014-03-25 08:05:01.331743 IST,,,p29754,th2145324448,"127.0.0.1","10575",2014-03-25 08:05:01 IST,0,,,seg-1,,,,,"WARNING","01000","PrimaryMirrorTransitionRequest (4) Result: Transition to primary/mirror mode PrimarySegment, data state InResync resulted in Error",,,,,,,0,,"primary_mirror_mode.c",1306,
Some of the Greenplum segment instance processes use fixed port numbers (listed in "gp_segment_configuration.port" and "gp_segment_configuration.replication_port"). Other processes (database or non-database) do no use specific ports but rely on the kernel to select the port numbers for them automatically.
If one of these automatically selected port's is taken while a segment instance is down, then later when segment instance attempts to start, it will fail with the above error (because it will not be able to start listening on its fixed ports).
1. Identify which processes own the ports that the specific segment instance needs.
To identify the port numbers which the specific segment instance needs to use, connect to the database and check the "gp_segment_configuration" table entry for the respective segment instance:
select port, replication_port from gp_segment_configuration where dbid = <instance_dbid>;
Once the port numbers are known, use the following command to identify which processes own these ports on the specific segment server:
netstat -anp | grep <port_number>
2. Make sure that the OS configuration is appropriate:
The Linux kernel has 2 relevant parameters that control the automatic allocation of network ports:
net.ipv4.ip_local_port_range - Range from which the network ports will be allocated for automatic port assignments.
net.ipv4.ip_local_reserved_ports - Range from which the network ports will not be allocated for automatic port assignments.
To avoid the "Address already in use" error, the fixed port numbers used by Greenplum Database processes need to be excluded from the kernel pool of port numbers by specifying them in "net.ipv4.ip_local_reserved_ports" or configuring the database cluster not to use ports within the range specified in "net.ipv4.ip_local_port_range." Please refer to the OS documentation on how to implement the changes in the kernel parameter values.
After the kernel parameter values are configured properly, the database needs to be restarted.
Notes regarding syntax for net.ipv4.ip.local_reserved_ports:
ip_local_reserved_ports - list of comma separated ranges Specify the ports which are reserved for known third-party applications. These ports will not be used by automatic port assignments (e.g. when calling connect() or bind() with port number 0). Explicit port allocation behavior is unchanged. The format used for both input and output is a comma separated list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and 10). Writing to the file will clear all previously reserved ports and update the current list with the one given in the input. Note that ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments.
3. Depending on what processes hold the needed ports:
- If these processes are non-database processes, they need to be stopped and "gprecoverseg" command should be re-executed.
- If the processes are database processes, the database needs to be restarted.