AIOps - Unable to login to DX due to postgres database corruption

Products

DX Operational Intelligence DX Application Performance Management

Issue/Introduction

Unable to login to DX UI

“Failed to login. Verify that user ID and password are correct”

Environment

DX Platform 2x

Cause

Corruption of the Postgres

Possible reasons:

1) NFS service stop working

2) Out of disk space in NFS

Symptoms:

Use-case #1 - postgres pod is in "CrashLoopBackOff"

Postgres pod is not running and you cannot ssh the pod:

kubectl get pods | grep postgres

Use-case #2: some of the database are corrupted, in this example "dsp_db"

postgres pod is in "Running" status:

kubectl get pods | grep postgres

In the postgres log, there are messages indicating that dsp_db database is corrupted

cd <NFS>/axaservices/pg-data/userdata/pg_log
tail -f postgresql-<day>.log

Resolution

Recommendation #1: Make sure NFS service is up and running

a) make sure nfs server is up and running:

systemctl status nfs-server

if not:
systemctl restart nfs

b) showmount should return list all nfs server and clients as defined in the /etc/exports, run below command from all openshift/k8s nodes:

showmount -e <NFS server>

if you get the below error then it means that there are some communication or network issues:

clnt_create: RPC: Port mapper failure - Unable to receive: errno 113 (No route to host)

make sure your firewall is configured correctly in the NFS server:

firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=mountd
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --reload

Recommendation #2: Restore the postgres database backup

IMPORTANT: one of the best practices is to regularly backup the postgres database in case of pogres db corruption:

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-platform-on-premise/20-2/installing/Upgrade-DX-Platform/Pre-Upgrade-Tasks.html#concept.dita_b213bb9e-a8ba-4a31-a1e2-d16aed126a1d_Backupdatabase

Use-case #1 - postgres pod is in "CrashLoopBackOff"

Reset PostgreSQL Transaction Logs (Xlogs) :

This will allow to clean up the transaction log as it is prevents posgres to startup, resetting the transaction log may allow PostgreSQL to start again.

1. Go to Openshift Console | Applications | pods

2. Open posgres pod

3. Click "Debug in Terminal" as illustrated below

4. Enter:

which pg_resetxlog
pg_resetxlog -D /var/lib/pgsql/data/userdata

5. delete the pod so a new one is created

Click Actions > Delete

6. Verification:

a) Check postgres pod is running:

oc get pods | grep postgres

b) Open the new postgres pod | Logs, check there is not errors

c) Go to the Terminal and check you can see all database, especially the aoplatform database

\l
\c aoplatform aopuser
select * from mcc_tenant;

Use-case #2: Some of the databases are corrupted, in this example "dsp_db"

1. Perform a backup:

a) Backup "aoplatform" database (this is a main and primary database that store all users and tenants information).

a) obtain the postgres pod name
kubectl get pods -ndxi | grep postgres

b) connect to posgres pod
kubectl exec -it <postgresql-pod> -ndxi bash

c) backup aoplatform
pg_dump -U aopuser -F p aoplatform > aoplatform.sql

d) exit from pod

e) copy aoplatform.sql from pod to local system for example to /
Example: kubectl cp dxi/<postgresql-pod>:/opt/app-root/src/aoplatform.sql /

b) Scale down to zero the postgresql deployment

kubectl scale deploy postgresql --replicas=0 -ndxi
Or using Kubernetes Console

c) Backup the entire postgres databases

mv <NFS>/axaservices/pg-data/ <NFS>/axaservices/pg-data.bck

4. Recreate the databases at their original state

a) Created a new folder <NFS>/axaservices/pg-data and grant 26:26 permission

mkdir <NFS>/axaservices/pg-data
chown -R 26:26 <NFS>/axaservices/pg-data

c) Scale up the postgresql deployment using kubectl or kubernetes console:

kubectl scale deploy postgresql --replicas=1 -ndxi

At this point a new set of databases will be created

d) Delete the newly created aoplatform database:

dropdb aoplatform;

NOTE: you might not be able to delete the database because sessions are in use, in this case you can proceed as below:

1) obtain the postgres pod name
kubectl get pods -ndxi | grep post

2) connect to posgres pod
kubectl exec -it <postgresql-pod> -ndxi bash

3) Identify open session
psql -U postgres -d postgres

SELECT * FROM pg_stat_activity;

4) \q
5) run: kill <all the pids sessions>

e) Restore the orignial aoplatform database that was not damaged and includes all your users and tenants information.

1) connect to posgres pod
kubectl exec -it <postgresql-pod> -ndxi bash

2) copy aoplatform.sql from local system to new pod
Example: kubectl cp /aoplatform.sql dxi/<postgresql-pod>:/opt/app-root/src/

3) restore aoplatform
createdb aoplatform;
psql -U aopuser -d aoplatform -f aoplatform.sql

NOTE: If you need to restore the original postgres database:
- scale down the postgres deployment
- restore the postgres pg-data fodler backup, in this example: /nfs/ca/dxi/axaservices/pg-data.bck
- scale up postgres to revert back all above changes.

5. Backup up postgres as indicated in documentation

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-platform-on-premise/20-2/installing/Upgrade-DX-Platform/Pre-Upgrade-Tasks.html#concept.dita_b213bb9e-a8ba-4a31-a1e2-d16aed126a1d_Backupdatabase

Additional Information

DX - Unable to login to tenant or masteradmin console :
https://knowledge.broadcom.com/external/article/185182