Metrics pods in crash loopback state on NSX Application Platform

Products

VMware NSX

Issue/Introduction

This article provides steps to remediate Metrics pods in crash loopback state on NSX Application Platform.

1. One of the following issues is observed on the NSX UI

Alarm for Metrics Delivery Failure.
System → NSX Application Platform shows the status is Degraded and the Metrics service is down.

2. Checking status of the pods by running the following commands as root on the CLI of the NSX manager shows the metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1 pods are in a CrashLoopBackOff state

napp-k get pods | grep metrics-postgresql-ha-postgresql
 
nsxi-platform metrics-postgresql-ha-postgresql-0 0/1 CrashLoopBackOff 
nsxi-platform metrics-postgresql-ha-postgresql-1 0/1 CrashLoopBackOff

3. A further check into the metrics-postgresql-ha-postgresql-0 logs indicates disk space issues.

napp-k logs -c postgresql metrics-postgresql-ha-postgresql-0 | grep PANIC
or
napp-k logs -c postgresql metrics-postgresql-ha-postgresql-1 | grep PANIC
 
Would show an entry similar to the one below
 
PANIC:  could not write to file "pg_wal/xlogtemp.154": No space left on device

Environment

VMware NSX-T Data Center

Cause

There are one or more inactive replication slots causing WAL files to build up eventually leading to the "No space left on device" errors causing the PostgreSQL server to crash.

Resolution

This will be fixed in future NSX-T releases.

Workaround:

Step #1: Follow the below resolution steps for NSX Application Platform Health alarm: Metrics Disk usage high/very high and increase the storage by 10Gi. Since the disk is already exhausted the pods will not be able to come up to carry out the steps below. Wait for the metrics-postgresql-ha-postgresql pods to be in a running state.

SSH on to one of the NSX Manager Nodes.
As the root user, execute the following commands:

a) napp-k edit pvc data-metrics-postgresql-ha-postgresql-0
b) Change the spec->resources->requests->storage value and save (note this editor operates using the same command structure as VIM)

Note: Recommendation is to increase the storage by at-least 10Gi.
Please confirm the datastore backing the worker nodes has enough available space for the increase in the storage.

c) napp-k delete pod metrics-postgresql-ha-postgresql-0
Note: This is a safe action and is needed for the storage change to take effect.

3. Repeat steps a, b and c for data-metrics-postgresql-ha-postgresql-1
While a and b are applicable for data-metrics-postgresql-ha-postgresql-1, c is applicable for metrics-postgresql-ha-postgresql-1

Step #2: Checking the disk usage on the postgresql master pod indicates most of the disk space under /bitnami/postgresql/data is taken up by /bitnami/postgresql/data/pg_wal and the disk usage of the stand-by pod is comparatively very low.

napp-k exec metrics-postgresql-ha-postgresql-0 -- du -h /bitnami/postgresql/data | sort -hr
 
9.8G    /bitnami/postgresql/data
9.5G    /bitnami/postgresql/data/pg_wal
  
napp-k exec metrics-postgresql-ha-postgresql-1 -- du -h /bitnami/postgresql/data | sort -hr
367M    /bitnami/postgresql/data

NOTE: The disk usage stats can be the opposite if metrics-postgresql-ha-postgresql-1 is the master.

The only thing to check and confirm here is that the disk usage of one of the nodes is much higher than the other and it is the pg_wal directory that is taking up most of the disk space.

If you do not see a major difference between the disk usage of the 2 pods you can stop here, this indicates the disk usage is due to a high scale of metrics in your environment and the disk usage is expected.

The steps followed in Step#1 should be able to take care of the increased disk requirements. At this point you can revisit Step#1 in case a further disk space increase is needed.

Step #3: Checking the replication details on postgresql indicates the presence of an "inactive" replication slot.

1. napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
 
2. PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
 
3. select pg_is_in_recovery();
 
If this returns f you are on the master node, execute #4
else
exit and repeat #1 to #3 for metrics-postgresql-ha-postgresql-1 followed by #4 if #3 returns f
 
4. SELECT * FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE\x\g\x
  
The output should look similar to the one below. In case you one or more entries in the response proceed to the steps in the remediation section
 
-[ RECORD 1 ]-------+-----------------
slot_name           | repmgr_slot_1002
plugin              |
slot_type           | physical
datoid              |
database            |
temporary           | f
active              | f
active_pid          |
xmin                |
catalog_xmin        |
restart_lsn         | 0/237FB820
confirmed_flush_lsn |
wal_status          | extended
safe_wal_size       |

Step #4: Delete the inactive replication slots following the below instructions.
Execute the following commands from the master node i.e. one where select pg_is_in_recovery() returned f

1. napp-k exec -it <master-postgresql-pod-name> bash 
   e.g. napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
2. PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h 127.0.0.1
3. /* Function checks for inactive replication slots and drops them */
CREATE OR REPLACE FUNCTION clear_inactive_replication_slots() RETURNS void as $$
DECLARE
    slot_names varchar;
BEGIN
 
FOR slot_names IN SELECT slot_name FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE
    LOOP
        RAISE INFO 'Deleting inactive replication slot %', slot_names;
        PERFORM pg_drop_replication_slot(slot_names);
    END LOOP;
 
END;
$$ LANGUAGE plpgsql;
 
/*Execute the inactive replication slot cleanup*/
SELECT clear_inactive_replication_slots();

You should see a response similar to the one below.

INFO: Deleting inactive replication slot repmgr_slot_1002

clear_inactive_replication_slots

----------------------------------

(1 row)

Step #5: Wait for a few minutes and execute the commands in Step#2 section.

You should see the disk usage of the master node has gone down and the same for both the postgrsql pods is similar.

Step #6: Wait for the other metrics pods to recover. Delete the crashlooped pods if required with the command -- napp-k delete pod <pod-name>

The metrics service on the UI should be UP.

Additional Information

Impact/Risks:

Metrics services will be down.