After upgrade to 4.0.x or 4.1.x, one or more of the following are seen on the UI
Checking status of the platform DB pod by running the following commands as root on the CLI of the NSX manager shows the postgresql-ha-postgresql-0
is in a CrashLoopBackOff state:
# napp-k get pods | grep postgresql-ha-postgresql nsxi-platform postgresql-ha-postgresql- 0 0 / 1 CrashLoopBackOff |
Note: Results include pods metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1. If they also show up in CrashLoopBackOff state, please refer to Metrics pods in crash loopback state on NSX Application Platform for more information.
If the above command shows you more than 1 platform postgresql server, i.e., you see server names such as 'postgresql-ha-postgresql-1', please contact GSS.
A further check into the postgresql-ha-postgresql-0 logs indicates disk space issues such as below
# napp-k logs -c postgresql postgresql-ha-postgresql- 0 | grep PANIC PANIC: could not write to file "pg_wal/xlogtemp.154" : No space left on device |
This has been fixed in NAPP 4.2.0 and beyond.
Follow the below resolution steps for NSX Application Platform Health alarm: Platform DB Disk usage high/very high and increase the storage by 10Gi. Since the disk is already exhausted the pods will not be able to come up to carry out the steps below. Wait for the postgresql-ha-postgresql-0 pod to be in a running state.
SSH on to one of the NSX Manager Nodes.a. napp-k edit pvc data-postgresql-ha-postgresql-0
b. Change the spec->resources->requests->storage value and save (note this editor operates
using the same command structure as VIM.
Note: Recommendation is to increase the storage by at-least 10Gi.
Please confirm the datastore backing the worker nodes has enough
available space for the increase in the storage.
c. napp-k delete pod postgresql-ha-postgresql-0
Note: This is a safe action and is needed for the storage change to take effect.
Checking the disk usage on the postgresql-ha-postgresql-0 pod indicates most of the disk space under /bitnami/postgresql/data is taken up by /bitnami/postgresql/data/pg_wal
# napp-k exec postgresql-ha-postgresql- 0 -- df -h /bitnami/postgresql Filesystem Size Used Avail Use% Mounted on /dev/sdi 30G 19 .8G 10G 67 % /bitnami/postgresql # napp-k exec postgresql-ha-postgresql- 0 -- du -h /bitnami/postgresql/data | sort -hr 19 .8G /bitnami/postgresql/data 19 .5G /bitnami/postgresql/data/pg_wal |
The only thing to check and confirm here is that the pg_wal directory that is taking up most of the disk space. Note that we just increased the disk size so 20Gi would be the overall available disk space originally.
The steps followed in Step#1 should be able to take care of the increased disk requirements. At this point you can revisit Step#1 in case a further disk space increase is needed.
Checking the replication details on postgresql indicates the presence of an "inactive" replication slot.
napp-k exec -it postgresql-ha-postgresql-0 -c postgresql -- /bin/bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
select pg_is_in_recovery();
SELECT * FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE\x\g\x
-[ RECORD 1 ]-------+----------------- slot_name | repmgr_slot_1001 plugin | slot_type | physical datoid | database | temporary | f active | f active_pid | xmin | catalog_xmin | restart_lsn | 0 /237FB820 confirmed_flush_lsn | wal_status | extended safe_wal_size | -[ RECORD 2 ]-------+----------------- slot_name | repmgr_slot_1002 plugin | slot_type | physical datoid | database | temporary | f active | f active_pid | xmin | catalog_xmin | restart_lsn | 0 /237FB820 confirmed_flush_lsn | wal_status | extended safe_wal_size | |
Delete the inactive replication slots following the below instructions.
Execute the following commands on the only platform DB pod named 'postgresql-ha-postgresql-0', which is also the master node where 'select pg_is_in_recovery()' returned f.
napp-k exec -it postgresql-ha-postgresql-0 c postgresql -- /bin/bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
# in the psql prompt, invoke commands to clean up the above inactive replication slots whose names start with 'repmgr_slot_', e.g.,
SELECT pg_drop_replication_slot( 'repmgr_slot_1002' ) SELECT pg_drop_replication_slot( 'repmgr_slot_1001' ) |
Wait for a few minutes and execute the commands in Step#2 section.
You should see the disk usage of the master node has gone down and the same for both the postgrsql pods is similar.
Wait for the other platform pods to recover. Delete the crashlooped pods if required with the command -- napp-k delete pod <pod-name>