Sometimes, a PostgreSQL replica may fail to recover on its own, hence, manual intervention is required.
Sometimes, due to multiple node failovers, Postgres may reach a state where the replica may fail to recover on its own.
If this occurs, follow the steps below
Step 1: Check the Cluster Status
Before taking any action, it’s important to check the current state of the PostgreSQL cluster. You can do this using the patroni-status private webhook.
Parameters Required: namespace (string): The namespace where your PostgreSQL cluster is running.
Example
Generate a security token
export K8S_TOKEN=$(kubectl get secrets hooks-server-synthetic-checker-krp -n vmsp-platform -ojsonpath={.data.token} | base64 -d)Trigger patroni-status webhook
curl -k -XPOST -H ""Authorization: Bearer $K8S_TOKEN"" -d '{""namespace"": ""testnd1""}' https://<node_ip>:30005/webhooks/vmsp-platform/database/patroni-statusGet the status of the webhook
curl -k -XGET -H ""Authorization: Bearer $K8S_TOKEN"" https://<node_ip>:30005/webhooks/vmsp-platform/database/patroni-status | jq -r '.output'Response
[ { ""Cluster"": ""pdtest1"", ""Member"": ""pdtest1-0"", ""Host"": "" 198.#.#.# "", ""Role"": ""Leader"", ""State"": ""running"", ""TL"": 2 }, { ""Cluster"": ""pdtest1"", ""Member"": ""pdtest1-1"", ""Host"": "" 198.#.#.# "", ""Role"": ""Sync Standby"", ""State"": ""streaming"", ""TL"": 2, ""Lag in MB"": 0 }, { ""Cluster"": ""pdtest1"", ""Member"": ""pdtest1-2"", ""Host"": "" 198.#.#.# "", ""Role"": ""Sync Standby"", ""State"": ""streaming"", ""TL"": 2, ""Lag in MB"": 0 }]
Expected Output for a Healthy Cluster:
One member in the Leader role.
Remaining members in Sync Replica/Replica roles
All replicas are in streaming state.
All members are on the same timeline.
Step 2: Identify Unhealthy Replica
If one or more replicas are stuck in problematic states, like:
starting
start failed
restart failed
They remain in this state for an extended period, it usually indicates that the replica is unable to recover automatically.
Example of a degraded cluster
Cluster: pdtest1 (7481767616172089438) +————–+—-+———–+ | Member | Host | Role | State | TL | Lag in MB | +———–+————-+————–+————–+—-+———–+ | pdtest1-0 | 198.#.#.# | Leader | running | 5 | 0 | | pdtest1-1 | 198.#.#.# | Replica | start failed | | unknown | | pdtest1-2 | 198.#.#.# | Sync Standby | streaming | 5 | 0 | +———–+————-+————–+————–+—-+———–+
Step 3: Reinitialize the Replica
In such cases, you can use the replica-reinit webhook to manually reinitialize the problematic replica.
Note:
Ensure the replica is not in streaming state and is not on the current timeline before triggering reinit.
This should be treated as a last resort when all automatic recovery options have failed.
Parameters Required:
name (string): Name of your PostgreSQL cluster.
namespace (string): The namespace where your PostgreSQL cluster is running.
replica (string): The exact name of the replica pod that needs to be reinitialized.
Example:
Generate a security token:
export K8S_TOKEN=$(kubectl get secrets hooks-server-synthetic-checker-krp -n vmsp-platform -ojsonpath={.data.token} | base64 -d)Trigger reinit-replica webhookcurl -k -XPOST -H ""Authorization: Bearer $K8S_TOKEN"" -d '{""name"":""pdtest1"", ""namespace"": ""testnd1"", ""replica"":""pdtest1-1""}' https://<node_ip>:30005/webhooks/vmsp-platform/database/reinit-replicaGet status of the webhookcurl -k -XGET -H ""Authorization: Bearer $K8S_TOKEN"" https://<node_ip>:30005/webhooks/vmsp-platform/database/reinit-replica | jq -r '.output'Response
Checking if all PostgreSQL Pods are running...
All PostgreSQL Pods are running.
Verifying replica status using patronictl list...
Defaulted container ""postgres"" out of: postgres, metrics-exporter[{""Cluster"": ""pdtest1"", ""Member"": ""pdtest1-0"", ""Host"": ""98.#.#.# "", ""Role"": ""Leader"", ""State"": ""running"", ""TL"": 2}, {""Cluster"": ""pdtest1"", ""Member"": ""pdtest1-1"", ""Host"": "" 198.#.#.# "", ""Role"": ""Replica"", ""State"": ""starting"", ""Lag in MB"": ""unknown""}, {""Cluster"": ""pdtest1"", ""Member"": ""pdtest1-2"", ""Host"": "" 198.#.#.# "", ""Role"": ""Sync Standby"", ""State"": ""streaming"", ""TL"": 2, ""Lag in MB"": 0}]
Leader is running: pdtest1-0
Confirming SeaweedFS Pods are running and accessible...
All SeaweedFS Pods are running.
Reinitializing the replica 'pdtest1-1' using patronictl reinit...
Defaulted container ""postgres"" out of: postgres, metrics-exporter+ Cluster: pdtest1 (7481767616172089438) +--------------+----+-----------+| Member | Host | Role | State | TL | Lag in MB |+-----------+-------------+--------------+--------------+----+-----------+| pdtest1-0 | 198.#.#.# | Leader | running | 5 | 0 || pdtest1-1 | 198.#.#.# | Replica | start failed | | unknown || pdtest1-2 | 198.#.#.# | Sync Standby | streaming | 5 | 0 |+-----------+-------------+--------------+--------------+----+-----------+Success: reinitialize for member pdtest1-1
Final Step: Monitor the Cluster Again:
After reinit is done, wait for the replica to resync and join the cluster. Use the patroni-status webhook again to confirm that:
The replica is now in replica/sync replica state.
It is streaming from the leader.
It has joined the current timeline.
Alternatively to Manually execute reinit commands on replica nodes: (applicable for versions such as vCF 9.0 where webhooks were not yet shipped)
1. Find the postgres leader pod
$ kubectl get pod -n prelude -lapplication=spilo,spilo-role=master NAME READY STATUS RESTARTS AGE vcfapostgres-1 3/3 Running 0 53m
The patronictl commands will be executed against the leader pod.
2. Verify that the postgres leader is Running. If it is not, it is not safe to perform the remediation.
$ kubectl exec vcfapostgres-1 -n prelude -- patronictl list Defaulted container "postgres" out of: postgres, metrics-exporter, walg + Cluster: vcfapostgres (7545852715054780503) -+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+--------------+--------------+-----------+----+-----------+ | vcfapostgres-0 | 198.#.#.# | Replica | starting | | unknown | | vcfapostgres-1 | 198.#.#.# | Leader | running | 2 | | +----------------+--------------+--------------+-----------+----+-----------+
Here vcfapostgres-1 is the Leader and the State=running so we may continue.
NOTE: If both Replicas are listed, only Replicas NOT in the streaming state need to be re-initialized. If only one Replica is listed,
as in this example, re-initializing it will allow the second Replica to start once the first Replica starts.
3. Re-initialize the stuck Replica.
The form of the patronictl command is: patronictl reinit vcfapostgres <replica name> --force
DO NOT SPECIFY THE LEADER NAME (vcfapostgres-1 in this example) OR THE POSTGRES CLUSTER WILL LIKELY BE BROKEN
$ kubectl exec vcfapostgres-1 -n prelude -- patronictl reinit vcfapostgres vcfapostgres-0 --force
4. Periodically execute Step 2 until the Replica is in the running state. At this point you will either reinit the other Replica or simply wait for it to start if it was not listed.