NAPP Platform Postgresql Pod in Crash Loopback State after Upgrading NAPP

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:

After upgrade to 4.0.x or 4.1.x, one or more of the following are seen on the UI

Alarm for Configuration Database
System → NSX Application Platform shows the status is Degraded

Checking status of the platform DB pod by running the following commands as root on the CLI of the NSX manager shows the postgresql-ha-postgresql-0 is in a CrashLoopBackOff state:

# napp-k get pods | grep postgresql-ha-postgresql

nsxi-platform postgresql-ha-postgresql-0 0/1 CrashLoopBackOff

Note: Results include pods metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1. If they also show up in CrashLoopBackOff state, please refer to Metrics pods in crash loopback state on NSX Application Platform for more information.

If the above command shows you more than 1 platform postgresql server, i.e., you see server names such as 'postgresql-ha-postgresql-1', please contact GSS.

A further check into the postgresql-ha-postgresql-0 logs indicates disk space issues such as below

# napp-k logs -c postgresql postgresql-ha-postgresql-0 | grep PANIC

PANIC: could not write to file "pg_wal/xlogtemp.154": No space left on device

Environment

VMware NSX 4.1.1, NSX Application Platform 4.1

Cause

Some prior version (before 4.0.1) of NSX Application Configuration database was running with 3 replicas, i.e., postgresql-ha-postgresql-0, postgresql-ha-postgresql-1, and postgresql-ha-postgresql-2. After upgrade, the number of the platform DB pods has been reduced to 1. As a result, pods postgresql-ha-postgresql-1 and postgresql-ha-postgresql-2 are no longer running but still registered as a standby node (stale entry in postgresql) even when there is no replicated DB but the master node still tries to connect to it and fails. This leads to the master node keeping all DB updates on the disk waiting for replicas to catch up with these updates.

Resolution

This has been fixed in NAPP 4.2.0 and beyond.

Workaround:

Step #1

Follow the below resolution steps for NSX Application Platform Health alarm: Platform DB Disk usage high/very high and increase the storage by 10Gi. Since the disk is already exhausted the pods will not be able to come up to carry out the steps below. Wait for the postgresql-ha-postgresql-0 pod to be in a running state.

SSH on to one of the NSX Manager Nodes.
As the root user, execute the following commands:

a. napp-k edit pvc data-postgresql-ha-postgresql-0

   b. Change the spec->resources->requests->storage value and save (note this editor operates
  using the same command structure as VIM.
Note:  Recommendation is to increase the storage by at-least 10Gi.
Please confirm the datastore backing the worker nodes has enough
available space for the increase in the storage.

c. napp-k delete pod postgresql-ha-postgresql-0
Note: This is a safe action and is needed for the storage change to take effect.

Step #2

Checking the disk usage on the postgresql-ha-postgresql-0 pod indicates most of the disk space under /bitnami/postgresql/data is taken up by /bitnami/postgresql/data/pg_wal

# napp-k exec postgresql-ha-postgresql-0 -- df -h /bitnami/postgresql
Filesystem      Size Used Avail Use% Mounted on
/dev/sdi         30G  19.8G 10G 67% /bitnami/postgresql

# napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
19.8G    /bitnami/postgresql/data
19.5G    /bitnami/postgresql/data/pg_wal

The only thing to check and confirm here is that the pg_wal directory that is taking up most of the disk space. Note that we just increased the disk size so 20Gi would be the overall available disk space originally.

The steps followed in Step#1 should be able to take care of the increased disk requirements. At this point you can revisit Step#1 in case a further disk space increase is needed.

Step #3

Checking the replication details on postgresql indicates the presence of an "inactive" replication slot.

napp-k exec -it postgresql-ha-postgresql-0 -c postgresql -- /bin/bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
select pg_is_in_recovery();
Since there is only 1 postgresql server running, it should return 'f' as it's the master node
SELECT * FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE\x\g\x
The output should look similar to the one below. In case you one or more entries in the response proceed to the steps in the remediation section

Step #4

Delete the inactive replication slots following the below instructions.
Execute the following commands on the only platform DB pod named 'postgresql-ha-postgresql-0', which is also the master node where 'select pg_is_in_recovery()' returned f.

napp-k exec -it postgresql-ha-postgresql-0 c postgresql -- /bin/bash

PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1

# in the psql prompt, invoke commands to clean up the above inactive replication slots whose names start with 'repmgr_slot_', e.g.,

SELECT pg_drop_replication_slot('repmgr_slot_1002')
SELECT pg_drop_replication_slot('repmgr_slot_1001')

Step #5

Wait for a few minutes and execute the commands in Step#2 section.

You should see the disk usage of the master node has gone down and the same for both the postgrsql pods is similar.

Step #6

Wait for the other platform pods to recover. Delete the crashlooped pods if required with the command -- napp-k delete pod <pod-name>