Monitoring PostgreSQL Replication Lag Using Prometheus Postgres Exporter
search cancel

Monitoring PostgreSQL Replication Lag Using Prometheus Postgres Exporter

book

Article ID: 427872

calendar_today

Updated On:

Products

VMware Tanzu for Postgres VMware Tanzu Data Suite VMware Tanzu Data Suite VMware Tanzu Greenplum VMware Tanzu Greenplum / Gemfire

Issue/Introduction

While setting up Disaster Recovery (DR) monitoring in HUB using Prometheus Postgres Exporter, there could be a need to monitor replication replay lag from the primary PostgreSQL database. PostgreSQL provides detailed replication lag information through the following query on the primary node:

SELECT
  application_name,
  state,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

 

Although this query returns valid time-based lag values for each connected replica, these fields are not exposed as Prometheus metrics by default in postgres_exporter.

The exporter does provide several replication-related metrics such as:

  • pg_replication_lag_seconds

  • pg_stat_replication_pg_wal_lsn_diff

  • pg_replication_slot_*

  • pg_stat_replication_reply_time

However, the metric pg_replication_lag_seconds typically reports:

  • 0 on the primary

  • Meaningful lag values only when scraped from replica (standby/DR) nodes

This could raise questions about:

  • Exporting custom queries to capture write_lag, flush_lag, and replay_lag

  • Calculating time-based lag on the primary using pg_stat_replication_reply_time

Resolution

1. Recommended Metric for Primary-Side Monitoring

For replication monitoring from the primary database, the supported and recommended metric is:

pg_stat_replication_pg_wal_lsn_diff

 

This metric:

  • Is exposed on the primary node

  • Reports WAL lag per replica

  • Measures replication delay in bytes

  • Requires no custom exporter configuration

Example Output:

pg_stat_replication_pg_wal_lsn_diff{
  application_name="<app_name>",
  client_addr="<IP>,
  slot_name="<name>",
  state="streaming"
} 0

 

This provides clear visibility into DR replica health directly from the primary.

 

2. Time-Based Replication Lag Monitoring

Time-based lag (pg_replication_lag_seconds) is not exposed per replica on the primary by postgres_exporter.

To monitor replication lag in seconds, follow this approach:

  • Scrape Prometheus metrics from all PostgreSQL nodes

  • Identify replica nodes using:

pg_replication_is_replica == 1

Use pg_replication_lag_seconds from replica nodes, not the primary

 

3. Custom Query Considerations

  • postgres_exporter does not export custom SQL queries by default

  • Calculating lag using:

current_time - pg_stat_replication_reply_time

is not supported via built-in exporter metrics

  • Direct SQL queries can still be used for manual checks or diagnostics, but not for Prometheus-based monitoring unless a custom exporter configuration is implemented