Tanzu Postgres for Kubernetes 3.0 to 4.2.4 failed with error "equested timeline N is not a child of this server's history"
search cancel

Tanzu Postgres for Kubernetes 3.0 to 4.2.4 failed with error "equested timeline N is not a child of this server's history"

book

Article ID: 440815

calendar_today

Updated On:

Products

VMware Tanzu Data Services

Issue/Introduction

Sometimes, when upgrade Tanzu Postgres for Kubernetes 3.0 to 4.2.4, certain postgres instance may failed, some may succeed. For the failed instance, from container log, it will report:

  INFO: unable to find 0000012345 in the archive asynchronously
  INFO: archive-get command end: completed successfully (104ms)
  FATAL:  requested timeline 20 is not a child of this server's history
  DETAIL:  Latest checkpoint is at AB/12345676 on timeline 19, but in the history of the requested timeline, the server forked off from that timeline at AB/12345879.

Cause

It's found from the /pgsql/data/pg_wal directory, the expected .history file did not exist. This caused the above error.  the history file should always exist when a cluster has gone through one or more promotions.  The immediate impact is that any new replica attempting to join will fail.

 

Resolution

R&D team provided a script (sonic-issue-v3.sh) to valid each instance before start the upgrade.  

./precheck.sh cluster_name instance_name

The script will generate a output like below :

Cluster with issue:
  Test[1] Current TL history file : FAIL
          00000008.history missing from pg_wal/ — standbys will fail to initialize via streaming

  Test[2] Stale/Next TL check     : SAFE
          no higher timeline history file found in local pg_wal/ or archive


Cluster with NO issue :
  Test[1] Current TL history file : PASS
          00000008.history exists in pg_wal/ — standbys can initialize via streaming

  Test[2] Stale/Next TL check     : SAFE
          no higher timeline history file found in local pg_wal/ or archive

 

For cluster with issue ( history file missing), the workaround is 

 Clusters failing Test 1

    Disable HA before upgrading and wait until the cluster has reduced to a single node:

highAvailability:
  enabled: false 
  readReplicas: 0

Do not proceed with the upgrade until the cluster confirms single-node state.

 

Clusters that had Test 2 FAIL

  These clusters carry stale history files and are at higher risk during the upgrade. If a cluster fails to come up after the upgrade, do not attempt manual intervention — restore immediately from the backup taken before the upgrade.

Attachments

precheck.sh get_app