1. One of the following issues is observed on the NSX UI
2. Checking status of the pods by running the following commands as root on the CLI of the NSX manager shows the metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1 pods are in a CrashLoopBackOff state
napp-k get pods | grep metrics-postgresql-ha-postgresql nsxi-platform metrics-postgresql-ha-postgresql-0 0/1 CrashLoopBackOff nsxi-platform metrics-postgresql-ha-postgresql-1 0/1 CrashLoopBackOff
3. A further check into the metrics-postgresql-ha-postgresql-0 logs indicates disk space issues.
napp-k logs -c postgresql metrics-postgresql-ha-postgresql-0 | grep PANIC or napp-k logs -c postgresql metrics-postgresql-ha-postgresql-1 | grep PANIC Would show an entry similar to the one below PANIC: could not write to file "pg_wal/xlogtemp.154": No space left on device
There are one or more inactive replication slots causing WAL files to build up eventually leading to the "No space left on device" errors causing the PostgreSQL server to crash.
This will be fixed in future NSX-T releases.
Workaround:
Step #1: Follow the below resolution steps for NSX Application Platform Health alarm: Metrics Disk usage high/very high and increase the storage by 10Gi. Since the disk is already exhausted the pods will not be able to come up to carry out the steps below. Wait for the metrics-postgresql-ha-postgresql pods to be in a running state.
a) napp-k edit pvc data-metrics-postgresql-ha-postgresql-0
b) Change the spec->resources->requests->storage value and save (note this editor operates using the same command structure as VIM)
Note: Recommendation is to increase the storage by at-least 10Gi.
Please confirm the datastore backing the worker nodes has enough available space for the increase in the storage.
c) napp-k delete pod metrics-postgresql-ha-postgresql-0
Note: This is a safe action and is needed for the storage change to take effect.
3. Repeat steps a, b and c for data-metrics-postgresql-ha-postgresql-1
While a and b are applicable for data-metrics-postgresql-ha-postgresql-1, c is applicable for metrics-postgresql-ha-postgresql-1
Step #2: Checking the disk usage on the postgresql master pod indicates most of the disk space under /bitnami/postgresql/data is taken up by /bitnami/postgresql/data/pg_wal and the disk usage of the stand-by pod is comparatively very low.
napp-k exec metrics-postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr 9.8G /bitnami/postgresql/data 9.5G /bitnami/postgresql/data/pg_wal napp-k exec metrics-postgresql-ha-postgresql-1 -- du -h /bitnami/postgresql/data | sort -hr 367M /bitnami/postgresql/data
NOTE: The disk usage stats can be the opposite
if
metrics-postgresql-ha-postgresql-
1
is the master.
The only thing to check and confirm here is that the disk usage of one of the nodes is much higher than the other and it is the pg_wal directory that is taking up most of the disk space.
If you
do
not see a major difference between the disk usage of the
2
pods you can stop here,
this
indicates the disk usage is due to a high scale of metrics in your environment and the disk usage is expected.
The steps followed in Step#1
should be able to take care of the increased disk requirements. At
this
point you can revisit Step#1
in
case
a further disk space increase is needed.
Step #3: Checking the replication details on postgresql indicates the presence of an "inactive" replication slot.
1. napp-k exec -it metrics-postgresql-ha-postgresql-0 bash 2. PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1 3. select pg_is_in_recovery(); If this returns f you are on the master node, execute #4 else exit and repeat #1 to #3 for metrics-postgresql-ha-postgresql-1 followed by #4 if #3 returns f 4. SELECT * FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE\x\g\x The output should look similar to the one below. In case you one or more entries in the response proceed to the steps in the remediation section -[ RECORD 1 ]-------+----------------- slot_name | repmgr_slot_1002 plugin | slot_type | physical datoid | database | temporary | f active | f active_pid | xmin | catalog_xmin | restart_lsn | 0/237FB820 confirmed_flush_lsn | wal_status | extended safe_wal_size |
Step #4: Delete the inactive replication slots following the below instructions.
Execute the following commands from the master node i.e. one where select pg_is_in_recovery() returned f
1. napp-k exec -it <master-postgresql-pod-name> bash e.g. napp-k exec -it metrics-postgresql-ha-postgresql-0 bash 2. PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1 3. /* Function checks for inactive replication slots and drops them */ CREATE OR REPLACE FUNCTION clear_inactive_replication_slots() RETURNS void as $$ DECLARE slot_names varchar; BEGIN FOR slot_names IN SELECT slot_name FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE LOOP RAISE INFO 'Deleting inactive replication slot %', slot_names; PERFORM pg_drop_replication_slot(slot_names); END LOOP; END; $$ LANGUAGE plpgsql; /*Execute the inactive replication slot cleanup*/ SELECT clear_inactive_replication_slots();
You should see a response similar to the one below.
INFO: Deleting inactive replication slot repmgr_slot_1002
clear_inactive_replication_slots
----------------------------------
(1 row)
Step #5: Wait for a few minutes and execute the commands in Step#2 section.
You should see the disk usage of the master node has gone down and the same for both the postgrsql pods is similar.
Step #6: Wait for the other metrics pods to recover. Delete the crashlooped pods if required with the command -- napp-k delete pod <pod-name>
The metrics service on the UI should be UP.
Metrics services will be down.