NAPP Platform Postgresql Pod in Crash Loopback State after Upgrading NAPP
search cancel

NAPP Platform Postgresql Pod in Crash Loopback State after Upgrading NAPP

book

Article ID: 320807

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

Symptoms:

After upgrade to 4.0.x or 4.1.x, one or more of the following are seen on the UI

  1. Alarm for Configuration Database 
  2. System → NSX Application Platform shows the status is Degraded

Checking status of the platform DB pod by running the following commands as root on the CLI of the NSX manager shows the postgresql-ha-postgresql-0 is in a CrashLoopBackOff state:

# napp-k get pods | grep postgresql-ha-postgresql
  
nsxi-platform postgresql-ha-postgresql-0 0/1 CrashLoopBackOff


Note: Results include pods  metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1.  If they also show up in CrashLoopBackOff state, please refer to Metrics pods in crash loopback state on NSX Application Platform for more information.

If the above command shows you more than 1 platform postgresql server, i.e., you see server names such as 'postgresql-ha-postgresql-1', please contact GSS.

A further check into the postgresql-ha-postgresql-0 logs indicates disk space issues such as below

# napp-k logs -c postgresql postgresql-ha-postgresql-0 | grep PANIC
 
PANIC: could not write to file "pg_wal/xlogtemp.154": No space left on device


Environment

VMware NSX 4.1.1

Cause

Some prior version (before 4.0.1) of NSX Application Configuration database was running with 3 replicas, i.e.,   postgresql-ha-postgresql-0, postgresql-ha-postgresql-1, and postgresql-ha-postgresql-2. After upgrade, the number of the platform DB pods has been reduced to 1.  As a result, pods postgresql-ha-postgresql-1 and postgresql-ha-postgresql-2 are no longer running but  still registered as a standby node (stale entry in postgresql) even when there is no replicated DB but the master node still tries to connect to it and fails.  This leads to the master node keeping all DB updates on the disk waiting for replicas to catch up with these updates.

Resolution

This has been fixed in NAPP 4.2.0 and beyond. 


Workaround:

Step #1

Follow the below resolution steps for NSX Application Platform Health alarm: Platform DB Disk usage high/very high and increase the storage by 10Gi. Since the disk is already exhausted the pods will not be able to come up to carry out the steps below. Wait for the postgresql-ha-postgresql-0 pod to be in a running state.

SSH on to one of the NSX Manager Nodes.
As the root user, execute the following commands: 

             a. napp-k edit pvc data-postgresql-ha-postgresql-0

             b. Change the spec->resources->requests->storage value and save (note this editor operates
                  using the same command structure as VIM.
                     Note:  Recommendation is to increase the storage by at-least 10Gi.
                     Please confirm the datastore backing the worker nodes has enough
                     available space for the increase in the storage.

            c.  napp-k delete pod postgresql-ha-postgresql-0
                     Note: This is a safe action and is needed for the storage change to take effect.




Step #2

Checking the disk usage on the postgresql-ha-postgresql-0 pod indicates most of the disk space under /bitnami/postgresql/data is taken up by /bitnami/postgresql/data/pg_wal 

# napp-k exec postgresql-ha-postgresql-0 -- df -h /bitnami/postgresql
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdi         30G  19.8G  10G 67% /bitnami/postgresql
 
# napp-k exec postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
19.8G    /bitnami/postgresql/data
19.5G    /bitnami/postgresql/data/pg_wal

The only thing to check and confirm here is that the pg_wal directory that is taking up most of the disk space.  Note that we just increased the disk size so 20Gi would be the overall available disk space originally.

The steps followed in Step#1 should be able to take care of the increased disk requirements. At this point you can revisit Step#1 in case a further disk space increase is needed.

Step #3

Checking the replication details on postgresql indicates the presence of an "inactive" replication slot.

  1. napp-k exec -it postgresql-ha-postgresql-0 -c postgresql -- /bin/bash
  2. PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
  3. select pg_is_in_recovery();
    Since there is only 1 postgresql server running, it should return 'f' as it's the master node
  4. SELECT * FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE\x\g\x
    The output should look similar to the one below. In case you one or more entries in the response proceed to the steps in the remediation section
-[ RECORD 1 ]-------+-----------------
slot_name           | repmgr_slot_1001
plugin              |
slot_type           | physical
datoid              |
database            |
temporary           | f
active              | f
active_pid          |
xmin                |
catalog_xmin        |
restart_lsn         | 0/237FB820
confirmed_flush_lsn |
wal_status          | extended
safe_wal_size       |
-[ RECORD 2 ]-------+-----------------
slot_name           | repmgr_slot_1002
plugin              |
slot_type           | physical
datoid              |
database            |
temporary           | f
active              | f
active_pid          |
xmin                |
catalog_xmin        |
restart_lsn         | 0/237FB820
confirmed_flush_lsn |
wal_status          | extended
safe_wal_size       |

Step #4

Delete the inactive replication slots following the below instructions.
Execute the following commands on the only platform DB pod named 'postgresql-ha-postgresql-0', which is also the master node where 'select pg_is_in_recovery()' returned f.

  1. napp-k exec -it postgresql-ha-postgresql-0 c postgresql -- /bin/bash
  2. PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
  3. # in the psql prompt, invoke commands to clean up the above inactive replication slots whose names start with 'repmgr_slot_', e.g.,

    SELECT pg_drop_replication_slot('repmgr_slot_1002')
    SELECT pg_drop_replication_slot('repmgr_slot_1001')

Step #5

Wait for a few minutes and execute the commands in Step#2 section.

You should see the disk usage of the master node has gone down and the same for both the postgrsql pods is similar.

Step #6

Wait for the other platform pods to recover. Delete the crashlooped pods if required with the command --  napp-k delete pod <pod-name>