VMware Aria automation Postgres service fails with CrashLoopbackOff state 'database system was interrupted while in recovery'
search cancel

VMware Aria automation Postgres service fails with CrashLoopbackOff state 'database system was interrupted while in recovery'

book

Article ID: 372624

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

The purpose of this article is to document a known issue and the workaround to fix it

 

# kubectl get pods -n prelude -l name=postgres -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
postgres-0 1/1 Running 0 33m 10.xx.xx.xx vRaFQDN.com <none> <none>
postgres-1 0/1 CrashLoopBackOff 11 33m 10.xx.x.xx vRa1FQDN.com <none> <none>
postgres-2 0/1 CrashLoopBackOff 11 33m 10.xx.x.xx vRa2FQDN.com <none> <none>

 

Snippets in the /service-logs/prelude/file-logs/postgres.log 

database system was interrupted while in recovery
2024-07-16 02:22:43.969 UTC [136] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2024-07-16 02:22:44.045 UTC [136] LOG: entering standby mode
2024-07-16 02:22:44.049 UTC [136] FATAL: recovery aborted because of insufficient parameter settings
2024-07-16 02:22:44.049 UTC [136] DETAIL: max_connections = 4100 is a lower setting than on the primary server, where its value was 4450.
2024-07-16 02:22:44.049 UTC [136] HINT: You can restart the server after making the necessary configuration changes.
2024-07-16 02:22:44.050 UTC [134] LOG: startup process (PID 136) exited with exit code 1
2024-07-16 02:22:44.050 UTC [134] LOG: aborting startup due to startup process failure
2024-07-16 02:22:44.066 UTC [134] LOG: database system is shut down

Environment

VMware Aria Automation 8.x

Cause

This might happen due to a race within the pods or split brain due to network isolation.

Resolution

Its good to isolate if Split Brain is due to network isolation by following this article https://knowledge.broadcom.com/external/article/317721.

If the outcome of the above article results in no luck, Please proceed with the below steps: 

  1. Take SSH session to vIDM nodes where the postgres is crashing ('kubectl get pods -n prelude -l name=postgres -o wide' this will let you know the postgres pod in CrashLoopBackOff status and node associated with it) and validate /data/db/live/postgresql.conf and look for 'max connections' value

    For Ex: In a 3 node Aria Automation cluster.
    Primary node might have value as 4450, where as the other 2 problematic node might have different value (4100)

  2. Go ahead modify the max connections to 4450 on the Problematic nodes where the postgres service is crashing
  3. Then restart the postgres Pod's using below command 

        kubectl delete pod -n prelude <Postgres_Pod_Name1>; kubectl delete pod -n prelude <Postgres_Pod_Name1>;

     For Ex: Get the details of problematic Pods from the output of step 1 ('kubectl get pods -n prelude -l name=postgres -o wide')
        kubectl delete pod -n prelude postgres-2; kubectl delete pod -n prelude postgres-2;

  4. This should bring the up the postgres Pods in all the nodes