Monitoring PostgreSQL Replication Lag Using Prometheus Postgres Exporter

search cancel

Monitoring PostgreSQL Replication Lag Using Prometheus Postgres Exporter

book

Article ID: 427872

calendar_today

Updated On:

Products

VMware Tanzu for Postgres VMware Tanzu Data Suite VMware Tanzu Data Suite VMware Tanzu Greenplum VMware Tanzu Greenplum / Gemfire

Issue/Introduction

While setting up Disaster Recovery (DR) monitoring in HUB using Prometheus Postgres Exporter, there could be a need to monitor replication replay lag from the primary PostgreSQL database. PostgreSQL provides detailed replication lag information through the following query on the primary node:

SELECT
  application_name,
  state,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

Although this query returns valid time-based lag values for each connected replica, these fields are not exposed as Prometheus metrics by default in postgres_exporter.

The exporter does provide several replication-related metrics such as:

pg_replication_lag_seconds
pg_stat_replication_pg_wal_lsn_diff
pg_replication_slot_*
pg_stat_replication_reply_time

However, the metric pg_replication_lag_seconds typically reports:

0 on the primary
Meaningful lag values only when scraped from replica (standby/DR) nodes

This could raise questions about:

Exporting custom queries to capture write_lag, flush_lag, and replay_lag
Calculating time-based lag on the primary using pg_stat_replication_reply_time

Resolution

1. Recommended Metric for Primary-Side Monitoring

For replication monitoring from the primary database, the supported and recommended metric is:

pg_stat_replication_pg_wal_lsn_diff

This metric:

Is exposed on the primary node
Reports WAL lag per replica
Measures replication delay in bytes
Requires no custom exporter configuration

Example Output:

pg_stat_replication_pg_wal_lsn_diff{
  application_name="<app_name>",
  client_addr="<IP>,
  slot_name="<name>",
  state="streaming"
} 0

This provides clear visibility into DR replica health directly from the primary.

2. Time-Based Replication Lag Monitoring

Time-based lag (pg_replication_lag_seconds) is not exposed per replica on the primary by postgres_exporter.

To monitor replication lag in seconds, follow this approach:

Scrape Prometheus metrics from all PostgreSQL nodes
Identify replica nodes using:

pg_replication_is_replica == 1

Use pg_replication_lag_seconds from replica nodes, not the primary

3. Custom Query Considerations

postgres_exporter does not export custom SQL queries by default
Calculating lag using:

current_time - pg_stat_replication_reply_time

is not supported via built-in exporter metrics

Direct SQL queries can still be used for manual checks or diagnostics, but not for Prometheus-based monitoring unless a custom exporter configuration is implemented

Feedback

thumb_up Yes

thumb_down No