ALERT: Some images may not load properly within the Knowledge Base Article. If you see a broken image, please right-click and select 'Open image in a new tab'. We apologize for this inconvenience.

AIOps - Unable to login to DX due to postgres database corruption

book

Article ID: 184125

calendar_today

Updated On:

Products

DX Operational Intelligence DX Application Performance Management

Issue/Introduction

Unable to login to DX UI

“Failed to login. Verify that user ID and password are correct”

Cause

Corruption of the Postgres

Possible reasons:

1) NFS service stop working 

2) Out of disk space in NFS 

Symptoms:

Use-case #1 - postgres pod is in "CrashLoopBackOff"

Postgres pod is not running and you cannot ssh the pod:

kubectl get pods | grep postgres


Use-case #2: some of the database are corrupted, in this example "dsp_db" 

postgres pod is in "Running" status:

kubectl get pods | grep postgres

In the postgres log, there are messages indicating that dsp_db database is corrupted

cd <NFS>/axaservices/pg-data/userdata/pg_log
tail -f postgresql-<day>.log

 

NOTE: You can to list all the databases in postgres as below:

a) obtain the postgres pod name
kubectl get pods -ndxi | grep post
postgresql-77c878cc47-76hwm                          1/1       Running       0          26s

b) login to pod
kubectl exec -it postgresql-77c878cc47-76hwm -ndxi bash

c) list databases:

psql -U postgres -d postgres
postgres=# \l
                                 List of databases
    Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
------------+----------+----------+------------+------------+-----------------------
 aoplatform | aopuser  | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/aopuser          +
            |          |          |            |            | aopuser=CTc/aopuser
 apmpe      | apmpe    | UTF8     | en_US.utf8 | en_US.utf8 |
 cpa        | aopuser  | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/aopuser          +
            |          |          |            |            | aopuser=CTc/aopuser
 doi        | aopuser  | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/aopuser          +
            |          |          |            |            | aopuser=CTc/aopuser
 dsp_db     | aopuser  | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/aopuser          +
            |          |          |            |            | aopuser=CTc/aopuser
 dxi        | dxi      | UTF8     | en_US.utf8 | en_US.utf8 |
 postgres   | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 template0  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
            |          |          |            |            | postgres=CTc/postgres
 template1  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
            |          |          |            |            | postgres=CTc/postgres
(9 rows)

postgres=# \q

Environment

DX Operational Intelligence 1.3.x, 20.x +
DX Application Performance Management 11.x, 20.x+

Resolution

Recommendation #1: Make sure NFS service is up and running

a) make sure nfs server is up and running:

systemctl status nfs-server

if not:
systemctl restart nfs

b) showmount should return list all nfs server and clients as defined in the /etc/exports, run below command from all openshift/k8s nodes:

showmount -e <NFS server>

if you get the below error then it means that there are some communication or network issues:

clnt_create: RPC: Port mapper failure - Unable to receive: errno 113 (No route to host)

make sure your firewall is configured correctly in the NFS server:

firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=mountd
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --reload

Recommendation #2: Restore the postgres database backup

IMPORTANT: one of the best practices is to regularly backup the postgres database in case of pogres db corruption:

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-platform-on-premise/20-2/installing/Upgrade-DX-Platform/Pre-Upgrade-Tasks.html#concept.dita_b213bb9e-a8ba-4a31-a1e2-d16aed126a1d_Backupdatabase

 
Use-case #1 - postgres pod is in "CrashLoopBackOff"
 
Reset PostgreSQL Transaction Logs (Xlogs) :
This will allow to clean up the transaction log as it is prevents posgres to startup, resetting the transaction log may allow PostgreSQL to start again.
 
1. Go to Openshift Console | Applications | pods
2. Open posgres pod 
3. Click "Debug in Terminal" as illustrated below
4. Enter:
which pg_resetxlog
pg_resetxlog -D /var/lib/pgsql/data/userdata
 
 
5. delete the pod so a new one is created
Click Actions > Delete
 
 
6. Verification:
 
a) Check postgres pod is running:
 
oc get pods | grep postgres
 
b) Open the new postgres pod | Logs, check there is not errors, for example:
 
 
c) Go to the Terminal and check you can see all database, especially the aoplatform database
 
\l
\c aoplatform aopuser
select * from mcc_tenant;
 
 
 
Use-case #2: Some of the databases are corrupted, in this example "dsp_db" 
 
1. Perform a backup:
 
a) Backup "aoplatform" database (this is a main and primary database that store all users and tenants information). 

a) obtain the postgres pod name
kubectl get pods -ndxi | grep postgres

b) connect to posgres pod
kubectl exec -it <postgresql-pod> -ndxi bash

c) backup aoplatform
pg_dump -U aopuser -F p aoplatform > aoplatform.sql
 
d) exit from pod

e) copy aoplatform.sql  from pod to local system for example to /
Example: kubectl cp dxi/<postgresql-pod>:/opt/app-root/src/aoplatform.sql /

b) Scale down to zero the postgresql deployment

kubectl scale deploy postgresql --replicas=0 -ndxi
Or using Kubernetes Console

 
c) Backup the entire postgres databases 

mv <NFS>/axaservices/pg-data/  <NFS>/axaservices/pg-data.bck

4. Recreate the databases at their original state
 
a) Created a new folder <NFS>/axaservices/pg-data and grant 26:26 permission

mkdir <NFS>/axaservices/pg-data
chown -R 26:26 <NFS>/axaservices/pg-data

c) Scale up the postgresql deployment using kubectl or kubernetes console:

kubectl scale deploy postgresql --replicas=1 -ndxi

At this point a new set of databases will be created
 
d) Delete the newly created aoplatform database:
 
dropdb aoplatform;

NOTE: you might not be able to delete the database because sessions are in use, in this case you can proceed as below:

   1) obtain the postgres pod name
   kubectl get pods -ndxi | grep post

   2) connect to posgres pod
   kubectl exec -it <postgresql-pod> -ndxi bash

   3) Identify open session
   psql -U postgres -d postgres

   SELECT * FROM pg_stat_activity;

   4) \q
   5) run:    kill <all the pids sessions> 

e) Restore the orignial aoplatform database that was not damaged and includes all your users and tenants information.

1) connect to posgres pod
kubectl exec -it <postgresql-pod> -ndxi bash

2) copy aoplatform.sql from local system to new pod
Example: kubectl cp /aoplatform.sql dxi/<postgresql-pod>:/opt/app-root/src/

3) restore aoplatform
createdb aoplatform;
psql -U aopuser -d aoplatform -f aoplatform.sql

 
NOTE: If you need to restore the original postgres database:
- scale down the postgres deployment
- restore the postgres pg-data fodler backup, in this example: /nfs/ca/dxi/axaservices/pg-data.bck
- scale up postgres to revert back all above changes.
 
5. Backup up postgres as indicated in documentation
 
 
 

Additional Information


DX - Unable to login to tenant or masteradmin console :
https://knowledge.broadcom.com/external/article/185182

Attachments