NAPP Platform Postgresql Pod Issue After Deleting NSX Intelligence

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Environment

All versions of NAPP, up to 4.2.0.

Cause

The buildup of WAL files in the platform PostgreSQL server can occur when NSX Intelligence is deactivated without properly cleaning up the Debezium logical replication slot. This can happen when a user activates NSX Intelligence, which creates a logical replication slot, and then deactivates it without removing the slot. As a result, the inactive slot remains, causing the WAL files to continue growing and filling up the disk space.

Resolution

Step #1
Follow the below resolution steps for NSX Application Platform Health alarm: Platform DB Disk usage high/very high and increase the storage by 10Gi. Since the disk is already exhausted the pods will not be able to come up to carry out the steps below. Wait for the postgresql-ha-postgresql-0 pod to be in a running state.
Make sure NSX Intelligence is not active in the system.

SSH on to one of the NSX Manager Nodes.
As the root user, execute the following commands:
a. napp-k edit pvc data-postgresql-ha-postgresql-0

b. Change the spec->resources->requests->storage value and save (note this editor operates
using the same command structure as VIM.
Note: Recommendation is to increase the storage by at-least 10Gi.
Please confirm the datastore backing the worker nodes has enough
available space for the increase in the storage.

c. napp-k delete pod postgresql-ha-postgresql-0
Note: This is a safe action and is needed for the storage change to take effect.

Step #2
Checking the disk usage on the postgresql-ha-postgresql-0 pod indicates most of the disk space under /bitnami/postgresql/data is taken up by /bitnami/postgresql/data/pg_wal

# napp-k exec postgresql-ha-postgresql-0 -- df -h /bitnami/postgresql
Filesystem Size Used Avail Use% Mounted on
/dev/sdi 30G 19.8G 10G 67% /bitnami/postgresql

# napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
19.8G /bitnami/postgresql/data
19.5G /bitnami/postgresql/data/pg_wal
The only thing to check and confirm here is that the pg_wal directory that is taking up most of the disk space. Note that we just increased the disk size so 20Gi would be the overall available disk space originally.

The steps followed in Step#1 should be able to take care of the increased disk requirements. At this point you can revisit Step#1 in case a further disk space increase is needed.

Step #3
Checking the replication details on postgresql indicates the presence of an "inactive" replication slot.

napp-k exec -it postgresql-ha-postgresql-0 -c postgresql -- /bin/bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
select pg_is_in_recovery();
Since there is only 1 postgresql server running, it should return 'f' as it's the master node
SELECT * FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE\x\g\x
The output should look similar to the one below. In case you one or more entries in the response proceed to the steps in the remediation section.
-[ RECORD 1 ]-------+-----------------
slot_name | debezium
plugin | pgoutput
slot_type | logical
datoid | 16577
database | pace
temporary | f
active | f
active_pid |
xmin |
catalog_xmin | 22306
restart_lsn | 0/BE0D368
confirmed_flush_lsn | 0/237FB820
wal_status | reserved
safe_wal_size |

Step #4
Delete the inactive replication slots following the below instructions.
Execute the following commands on the only platform DB pod named 'postgresql-ha-postgresql-0', which is also the master node where 'select pg_is_in_recovery()' returned f.

napp-k exec -it postgresql-ha-postgresql-0 c postgresql -- /bin/bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
# in the psql prompt, invoke commands to clean up the above inactive replication slot called "debezium", e.g.,

SELECT pg_drop_replication_slot('debezium');

Step #5
Wait for a few minutes and execute the commands in Step#2 section.

You should see the disk usage of the node has gone down

Additional Information

Another scenario that can lead to the same symptom described in the KB article happens after users delete NSX Intelligence from their setup. When this happen, an inactive replication slot could stay in the platform Postgresql which leads to the POD's storage being occupied by WAL fils.

https://knowledge.broadcom.com/external/article?legacyId=97178