DX APM Command Center - How to fix corrupted DB data files?
search cancel

DX APM Command Center - How to fix corrupted DB data files?

book

Article ID: 275806

calendar_today

Updated On:

Products

DX Application Performance Management

Issue/Introduction

We experience issues with DX Platform ACC pods are unable to initialize:

kubectl get pods --namespace dxi -o wide| grep -v Running

ng-acc-configserver-cp-deployment-#######-###     0/1     Init:CrashLoopBackOff   9 (99s ago)     24m              

ng-acc-configserver-db-deployment-#######-###     0/1     CrashLoopBackOff        9 (2m56s ago)   24m         

ng-acc-configserver-deployment-#######-###        0/1     Init:CrashLoopBackOff   9 (108s ago)    24m    

 

kubectl  logs -ndxi ng-acc-configserver-db-deployment-#######-###

Defaulted container "ng-acc-configserver-db-container" out of: ng-acc-configserver-db-container, init-fs (init)

PostgreSQL Database directory appears to contain a database; Skipping initialization

waiting for server to start....2023-11-06 14:19:05.943 UTC [16]: [1] LOG:  pgaudit extension initialized

2023-11-06 14:19:05.944 UTC [16]: [2] LOG:  starting PostgreSQL 12.12 on x86_64-alpine-linux-gnu, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 64-bit

2023-11-06 14:19:05.944 UTC [16]: [3] LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"

2023-11-06 14:19:06.064 UTC [17]: [1] LOG:  database system was shut down at 2023-11-05 04:50:29 UTC

2023-11-06 14:19:06.069 UTC [17]: [2] LOG:  invalid resource manager ID in primary checkpoint record

2023-11-06 14:19:06.069 UTC [17]: [3] PANIC:  could not locate a valid checkpoint record

.2023-11-06 14:19:07.582 UTC [16]: [4] LOG:  startup process (PID 17) was terminated by signal 6: Aborted

2023-11-06 14:19:07.582 UTC [16]: [5] LOG:  aborting startup due to startup process failure

2023-11-06 14:19:07.649 UTC [16]: [6] LOG:  database system is shut down

stopped waiting

pg_ctl: could not start server

Examine the log output

Environment

DX Platform 23.x

Cause

DB data files are corrupted for some reason (i.e., typically sudden termination of PG pod).
 

Resolution

 

OPTION 1) [RECOMMENDED] Use Postgres tools to fix the corrupted data files

 
For this option it is necessary to run pg tools on corrupted data files.
One of the approaches is to run a NEW pod with mounted pg data files and providing pg tools (see below steps)
    - if it is successful, then there is no data loss
    - note that these steps are manual and requires k8s/o~s knowledge, as a pod with mounted pg data + tooling will be manually deployed and used.
 
INSTRUCTIONS:

1) Download attached DE570003-ng-acc-configserver-db-cli.yml  deployment which starts a pod with pg tooling and mounted volumes.

2) You have to modify the following 4 items in the yml file:

  • <registry>/acc-postgresql:<23.1.0.x> - use acc-postgresql image as used in ng-acc-configserver-db
  • namespace - default is dxi, update it as required with your namespace
  • persistentVolumeClaim - default is dxi,  update it as required
  • serviceAccout - default is dxi, update it with dxi-acc

3) Save it and deploy it using k8s cli or UI:

    kubectl -n <namespace> apply -f <yml file>

4) Fix the Posgres DB:

a) scale down ng-acc-configserver-db deployment

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-deployment --replicas=0

b) scale up the NEW ng-acc-configserver-db-cli : it will create a new pod that has Postgres tooling and has mounted the postgres directories

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-cli --replicas=1


- open terminal of ng-acc-configserver-db-cli deployment's pod:

     kubectl -n dxi exec -it <your  ng-acc-configserver-db-cli pod id> -- /bin/bash
 

- run pg_resetwal command

     pg_resetwal /var/lib/postgresql/data/

       the command will report the issues, fixes, otherwise, the expected output is: "Write-head log reset"

5) scale down the deployment ng-acc-configserver-db-cli

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-cli --replicas=0

6) scale up ng-acc-configserver-db deployment

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-deployment --replicas=1

7) check the ng-acc-configserver-db  pod log and verify DB is up now:

the expected messages is "database system is ready to accept connections"

kubectl -n <your-namespace> logs ng-acc-configserver-db-deployment-####-### 

8) scale up the rest of the ACC pods

kubectl -n <namespace> scale deployment ng-acc-configserver-cp-deployment --replicas=1
kubectl -n <namespace> scale deployment ng-acc-configserver-deployments --replicas=1
kubectl -n <namespace> scale deployment ng-acc-repository-deployment --replicas=1

9) Verify ACC is working as expected:

- check that all acc pods are up and running.
- login to APM, check that Agent packages are available
- open Command Center, check that historical information is available.

 

OPTION 2) Recreated ACC Postgres database

 
In this case:
- all existing Acc data will be lost
- tenant system data
- link (id) between Acc partition and partition in datastore partition registration
 
This option is recommended to recreate the pg database only if the installation is a new one
 

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html#mcetoc_1f7qcopf91v9 

Attachments

DE570003-ng-acc-configserver-db-cli.yml get_app