When recovering a segment after a failure "gprecoverseg" was run to synchronise the acting primary to the acting mirror.
Then to rebalance "gprecoverseg -r" was run to fail back the segment(s) to the normal primaries.
The "gprecoverseg -r" reported:
DETAIL: could not connect to server: Connection refused
Is the server running on host "10.10.10.20" and accepting
TCP/IP connections on port 40004?
(seg12 10.10.10.20:40004)
The segment log of the segment referenced in the above error message shows:
2024-05-06 04:06:47.262731 EDT,,,p19701,th-107673472,,,,0,,,seg12,,,,,"ERROR","XX000","could not send end-of-streaming message to primary: no COPY in progress
",,,,,,,0,,"libpqwalreceiver.c",267,"Stack trace:
1 0xbffbdc postgres errstart (elog.c:557)
2 0xa54cb0 postgres <symbol not found> (libpqwalreceiver.c:265)
3 0xa48a90 postgres WalReceiverMain (walreceiver.c:541)
4 0x78c07a postgres AuxiliaryProcessMain (bootstrap.c:438)
5 0xa1a6bc postgres <symbol not found> (postmaster.c:5762)
6 0xa1c70f postgres <symbol not found> (postmaster.c:2152)
7 0x7f4af6f5d630 libpthread.so.0 <symbol not found> + 0xf6f5d630
8 0x7f4af63d6b23 libc.so.6 __select + 0x13
9 0x6b253a postgres <symbol not found> (postmaster.c:1891)
10 0xa1de96 postgres PostmasterMain (postmaster.c:1520)
11 0x6b7011 postgres main (main.c:205)
12 0x7f4af6303555 libc.so.6 __libc_start_main + 0xf5
13 0x6c2ebc postgres <symbol not found> + 0x6c2ebc
"
2024-05-06 04:06:47.269643 EDT,,,p37443,th-107673472,,,,0,,,seg12,,,,,"LOG","00000","record with zero length at 121/83B63BE8",,,,,,,0,,"xlog.c",4376,
2024-05-06 04:06:47.462553 EDT,,,p1134,th-107673472,,,,0,,,seg12,,,,,"ERROR","XX000","could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host ""sdw3"" (10.10.10.21) and accepting
TCP/IP connections on port 50004?
",,,,,,,0,,"libpqwalreceiver.c",154,"Stack trace:
1 0xbffbdc postgres errstart (elog.c:557)
2 0xa54410 postgres <symbol not found> (libpqwalreceiver.c:112)
The "startup.log" file in the segments log directory shows:
2024-05-06 02:42:55.761845 EDT,,,p37373,th-107673472,,,,0,,,seg12,,,,,"LOG","XX000","could not bind IPv4 socket: Address already in use",,"Is another postmaster already running on port 40004? If not, wait a few seconds and retry.",,,,,,"StreamServerPort","pqcomm.c",506,
The segments use ports that can be taken and used by the system for ephemeral ports.
If some other process is using the port when the recovery or segment start is happening, then the segment will not be able to bind to the port and will not try again until it is restarted.
Restart the database with
gpstop -af
gpstart -a
Reserve the ports used by the segments by setting "net.ipv4.ip_local_reserved_ports" in the /etc/sysctl.conf file on all the hosts in the cluster.
Get the ports used by the segments, both primary and mirrors:
gpadmin=# select distinct port from gp_segment_configuration order by 1;
port
-------
5432 <--- This is the coordinator port
30080
30081
30082
30083
35080
35081
35082
35083
(4 rows)
In the above example, the segments are using ports 30080-30083 and ports 35080-35083
Add the following lines to the /etc/sysctl.conf file on all hosts in the cluster:
net.ipv4.ip_local_port_range = 10000 65535 # This allows ephemeral ports between 10000 and 65535
net.ipv4.ip_local_reserve_ports =30080-30083,35080-35083 # This will not allow ephemeral ports on the segment ports, Choose values appropriate for each cluster.
Placing the values in the /etc/sysctl.conf file will ensure that the values are set on the next reboot of the host.
To set them immediately without a reboot, run the following as root on each host:
sysctl net.ipv4.ip_local_port_range='10000 65535'
sysctl net.ipv4.ip_local_reserved_ports=30080-30083,35080-35083 # Choose appropriate values for each cluster
NOTE: Setting them with the above commands will only survive until next host reboot, then the settings in the /etc/sysctl.conf file will take effect on next reboot.