Aria Automation Portal is down with one of the postgres pods showing 0/1 Running

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

User sees Aria Automation portal is down.

Run 'kubectl get nodes -n prelude' - All nodes show in a ready state.
Run '/opt/health/run.sh' on each node - It comes back with no errors on any node.
Run 'kubectl get pods -n prelude -o wide' and you see:
- ```
postgres-# 0/1 Running #
```
- That postgres pod also has a number or restarts.
- This indicates that the postgres-1 node is not healthy.

Run 'vracli status' and see:

{
   "Node name": "postgres-#",
 "vra_status": "error",
 "vra_error": "error: unable to upgrade connection: container not found (\"control\")"
}

Find out which node is running this pod:

kubectl get pods -n prelude -o wide --selector=app=postgres

SSH to the appliance running this pod and run:

cd /services-logs/prelude/
cd postgres-# (the number will be specific to the appliance)
less postgres.log

You will observe output such as:

#-#-# #:#:#.# UTC [#] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
#-#-# #:#:#.# UTC [#] LOG:  entering standby mode
#-#-# #:#:#.# UTC [#] FATAL:  requested timeline # does not contain minimum recovery point #/### on timeline #
#-#-# #:#:#.# UTC [#] LOG:  startup process (PID #) exited with exit code #
#-#-# #:#:#.# UTC [#] LOG:  aborting startup due to startup process failure
#-#-# #:#:#.# UTC [#] LOG:  database system is shut down

Environment

Aria Automation 8.18.x

Cause

Issue where the secondary Database WAL checkpoint timeline is ahead of the primary database upon crash.

The timeline on the secondary database node will be higher than the minimum recovery point timeline found on the primary database node.

Resolution

⚠️ WARNING: Please ensure that you have taken a valid snapshot and backup ahead of running these destructive commands.

SSH to affected node and run:
1. Go to the Database directory.
  1. ```
  cd /data/db/
```
2. Remove the copy of the live Database files on the affected Secondary node.
  1. ```
  rm -r /live/*
```
3. Remove any flags to set various modes (Standby, Debug etc.):
  1. ```
  rm -r /flags/*
```
4. Restart the "postgres" pods:
  1. ```
  kubectl delete pods -n prelude --selector=app=postgres
```
Run 'kubectl get pods -n prelude -o wide --selector=app=postgres'
Once the 3 nodes come up without issue (showing 'postgres-# 1/1 Running') , SSH to the node running the 'postgres-0' pod and run:
1. ```
kubectl exec -it -n prelude postgres-0 -- bash
```
Once connected to the postgres primary pod shell, change user to "postgres":
1. ```
su - postgres
```
Then show the postgres cluster configuration:
1. ```
repmgr -f /etc/repmgr.conf cluster show
```

You will observe the following:

 ID  | Name                                          | Role    | Status    | Upstream                                      | Location | Priority | Timeline | Connection string
-----+-----------------------------------------------+---------+-----------+-----------------------------------------------+----------+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------
 100 | postgres-0.postgres.prelude.svc.cluster.local | primary | * running |                                               | default  | 100      | 6        | host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/run/repmgr-db.cred connect_timeout=10 keepalives=1
 101 | postgres-1.postgres.prelude.svc.cluster.local | standby |   running | postgres-0.postgres.prelude.svc.cluster.local | default  | 99       | 6        | host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/run/repmgr-db.cred connect_timeout=10 keepalives=1
 102 | postgres-2.postgres.prelude.svc.cluster.local | standby |   running | postgres-0.postgres.prelude.svc.cluster.local | default  | 98       | 6        | host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/run/repmgr-db.cred connect_timeout=10 keepalives=1

Type 'exit' and hit Enter.
Type 'exit' and hit Enter again, you should be back on the Appliance shell.
Run '/opt/scripts/deploy.sh' from the Primary node.