The connection state "waiting for replication" usually occurs when the primary to mirror sync is slow and It indicates the backend process on the primary is waiting for synchronizing WALs to mirror. This usually happens during the transaction commit stage, the WALs were flushed to local disk on the primary, but need to be transmitted and flushed to mirror disk.
Showing too many this messages may be caused by various reasons:
- Network I/O status between primary and mirror
- Workloads or disk I/O on remote (or mirror) machine
- Mirror is in catchup state, "select * from gp_stat_replication;" on the current primary helps to understand this.
- Large amount of concurrent write requests.
One of the action items is to gather the following on the primary and mirror host:
- iostat -dx 1 15
- netstat -i [Run this a minimum 3-4 times to gather consistent outputs]
- Check if the cluster is using an architecture similar to GBB [Primary on NVME drives, mirror on slower SSDs]
- The issue can also occur if the current primary has got marked down and has left orphaned sessions that are waiting on "waiting for XXX/XXXXXXXX replication”. In this case, try connecting to the primary in utility mode and issue a pg_terminate_backend on the orphaned connections.
In the above iostat and netstat outputs, watch for:
- Higher awaits or w_awaits on the mirror mountpoints.
- Increasing RX-ERR, RX-DRP in the netstat -i outputs. [Try collecting 4-5 runs of netstat for checking increase in the packet numbers.
- Gather /var/log/messages on the mirror host to check for disk or controller issues.