After upgrading the Postgres Operator from version 3.0.0 to 4.2.4, Postgres instances fail to become fully healthy.
Postgres instance remains stuck in 3/4 readiness, with the pg-container failing.
The pod logs show the following error:
CRITICAL: system ID mismatch, node <instance-name> belongs to a different cluster:
<system_id_1> != <system_id_2>
Each PostgreSQL cluster has a unique internal system identifier, stored in the data directory (PGDATA).
In Operator 3.x architecture, Patroni stores cluster metadata and state information in Kubernetes ConfigMaps (including leader information and cluster identity).
During the operator upgrade from 3.0.0 to 4.2.4:
The PostgreSQL data directory remains intact.
However, stale Patroni cluster metadata stored in Kubernetes ConfigMaps may persist.
If the metadata in these ConfigMaps contains a system identifier that does not match the one stored in the PostgreSQL data directory, Patroni detects a mismatch.
When this occurs, Patroni prevents PostgreSQL from starting and logs: CRITICAL: system ID mismatch
The issue is therefore caused by stale or inconsistent Patroni cluster metadata remaining after operator upgrade, leading to a mismatch between:
The PostgreSQL data directory system ID
The system ID stored in Kubernetes ConfigMaps
The issue can be resolved by removing the stale Patroni ConfigMaps and allowing them to be recreated automatically.
Step 1: For the affected cluster, delete the following ConfigMaps:
kubectl delete cm <cluster-name>-config
kubectl delete cm <cluster-name>-custom-config
kubectl delete cm <cluster-name>-leader
kubectl delete cm <cluster-name>-sync
Step 2: Delete all the pods belonging to that instance:
kubectl delete pod <postgres-pod-name> -n <namespace>