DX APM Command Center - How to fix corrupted DB data files?
DX APM Command Center - How to fix corrupted DB data files?


Article ID: 275806


DX Application Performance Management


We experience issues with DX Platform ACC pods are unable to initialize:

kubectl get pods --namespace dxi -o wide| grep -v Running

ng-acc-configserver-cp-deployment-#######-###     0/1     Init:CrashLoopBackOff   9 (99s ago)     24m              

ng-acc-configserver-db-deployment-#######-###     0/1     CrashLoopBackOff        9 (2m56s ago)   24m         

ng-acc-configserver-deployment-#######-###        0/1     Init:CrashLoopBackOff   9 (108s ago)    24m    


kubectl  logs -ndxi ng-acc-configserver-db-deployment-#######-###

Defaulted container "ng-acc-configserver-db-container" out of: ng-acc-configserver-db-container, init-fs (init)

PostgreSQL Database directory appears to contain a database; Skipping initialization

waiting for server to start....2023-11-06 14:19:05.943 UTC [16]: [1] LOG:  pgaudit extension initialized

2023-11-06 14:19:05.944 UTC [16]: [2] LOG:  starting PostgreSQL 12.12 on x86_64-alpine-linux-gnu, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 64-bit

2023-11-06 14:19:05.944 UTC [16]: [3] LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"

2023-11-06 14:19:06.064 UTC [17]: [1] LOG:  database system was shut down at 2023-11-05 04:50:29 UTC

2023-11-06 14:19:06.069 UTC [17]: [2] LOG:  invalid resource manager ID in primary checkpoint record

2023-11-06 14:19:06.069 UTC [17]: [3] PANIC:  could not locate a valid checkpoint record

.2023-11-06 14:19:07.582 UTC [16]: [4] LOG:  startup process (PID 17) was terminated by signal 6: Aborted

2023-11-06 14:19:07.582 UTC [16]: [5] LOG:  aborting startup due to startup process failure

2023-11-06 14:19:07.649 UTC [16]: [6] LOG:  database system is shut down

stopped waiting

pg_ctl: could not start server

Examine the log output


DX Platform 23.x


DB data files are corrupted for some reason (i.e., typically sudden termination of PG pod).



OPTION 1) [RECOMMENDED] Use Postgres tools to fix the corrupted data files

For this option it is necessary to run pg tools on corrupted data files.
One of the approaches is to run a NEW pod with mounted pg data files and providing pg tools (see below steps)
    - if it is successful, then there is no data loss
    - note that these steps are manual and requires k8s/o~s knowledge, as a pod with mounted pg data + tooling will be manually deployed and used.

1) Download attached DE570003-ng-acc-configserver-db-cli.yml  deployment which starts a pod with pg tooling and mounted volumes.

2) You have to modify the following 4 items in the yml file:

  • <registry>/acc-postgresql:<23.1.0.x> - use acc-postgresql image as used in ng-acc-configserver-db
  • namespace - default is dxi, update it as required with your namespace
  • persistentVolumeClaim - default is dxi,  update it as required
  • serviceAccout - default is dxi, update it with dxi-acc

3) Save it and deploy it using k8s cli or UI:

    kubectl -n <namespace> apply -f <yml file>

4) Fix the Posgres DB:

a) scale down ng-acc-configserver-db deployment

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-deployment --replicas=0

b) scale up the NEW ng-acc-configserver-db-cli : it will create a new pod that has Postgres tooling and has mounted the postgres directories

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-cli --replicas=1

- open terminal of ng-acc-configserver-db-cli deployment's pod:

     kubectl -n dxi exec -it <your  ng-acc-configserver-db-cli pod id> -- /bin/bash

- run pg_resetwal command

     pg_resetwal /var/lib/postgresql/data/

       the command will report the issues, fixes, otherwise, the expected output is: "Write-head log reset"

5) scale down the deployment ng-acc-configserver-db-cli

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-cli --replicas=0

6) scale up ng-acc-configserver-db deployment

kubectl -n <your-namespace> scale deployment ng-acc-configserver-db-deployment --replicas=1

7) check the ng-acc-configserver-db  pod log and verify DB is up now:

the expected messages is "database system is ready to accept connections"

kubectl -n <your-namespace> logs ng-acc-configserver-db-deployment-####-### 

8) scale up the rest of the ACC pods

kubectl -n <namespace> scale deployment ng-acc-configserver-cp-deployment --replicas=1
kubectl -n <namespace> scale deployment ng-acc-configserver-deployments --replicas=1
kubectl -n <namespace> scale deployment ng-acc-repository-deployment --replicas=1

9) Verify ACC is working as expected:

- check that all acc pods are up and running.
- login to APM, check that Agent packages are available
- open Command Center, check that historical information is available.


OPTION 2) Recreated ACC Postgres database

In this case:
- all existing Acc data will be lost
- tenant system data
- link (id) between Acc partition and partition in datastore partition registration
This option is recommended to recreate the pg database only if the installation is a new one

DE570003-ng-acc-configserver-db-cli.yml get_app