Metrics pods in crash loopback state after upgrading NAPP to 4.1.1

Products

VMware NSX

Issue/Introduction

One of the following is seen on the NSX UI

Alarm for Metrics Delivery Failure

System → NSX Application Platform shows the status is Degraded and the Metrics service is down.

Checking status of the pods by running the following commands as root on the CLI of the NSX manager shows the metrics-postgresql-ha-postgresql-0 and/or metrics-postgresql-ha-postgresql-1 pods are in a CrashLoopBackOff state

napp-k get pods | grep metrics-postgresql-ha-postgresql

nsxi-platform metrics-postgresql-ha-postgresql-0 0/1 CrashLoopBackOff
nsxi-platform metrics-postgresql-ha-postgresql-1 0/1 CrashLoopBackOff

A further check into the metrics-postgresql-ha-postgresql-0 logs indicates host translation issues

napp-k logs -c postgresql metrics-postgresql-ha-postgresql-0
or
napp-k logs -c postgresql metrics-postgresql-ha-postgresql-1

Would show an entries similar to the ones below

2023-12-18T11:39:00.475814367Z stderr F could not translate host name "metrics-postgresql-ha-postgresql-2.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local" to address: Name or service not known

2023-12-18T11:39:00.475817531Z stderr F

2023-12-18T11:39:00.475819849Z stderr F [2023-12-18 11:39:00] [DETAIL] attempted to connect using:

2023-12-18T11:39:00.475825843Z stderr F user=repmgr password=aTOOprLBMMf7SIXtzIABSnWG1sVXWaY0 connect_timeout=20 dbname=repmgr host=metrics-postgresql-ha-postgresql-2.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path=

2023-12-18T11:39:00.475829208Z stderr F [2023-12-18 11:39:00] [ERROR] unable connect to upstream node (ID: 1002), terminating

2023-12-18T11:39:00.475832067Z stderr F [2023-12-18 11:39:00] [HINT] upstream node must be running before repmgrd can start

Environment

VMware NSX

Cause

Before 4.1.1 metrics-postresql service was running with 3 replicas, i.e. metrics-postgresql-ha-postgresql-0 , metrics-postgresql-ha-postgresql-1 and metrics-postgresql-ha-postgresql-2. After the upgrade, the number of replicas has been changed to 2 i.e. only metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1 would be available.

In this case, metrics-postgresql-ha-postgresql-2 is still registered as a standby node (stale entry in postgresql) even when the replica count is 2, and the master node tries to connect to it and fails.

Resolution

Scale metrics-postgresql-ha-postgresql to 3 replicas

# napp-k scale statefulsets metrics-postgresql-ha-postgresql --replicas=3

Unregister metrics-postgresql-ha-postgresql-2 as a standby

# napp-k exec -it metrics-postgresql-ha-postgresql-0 bash

# repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact

This will show you the current cluster details.

Check Role and Status value for metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1, the status of these 2 should be either running OR running as primary.

If you see an entry for metrics-postgresql-ha-postgresql-2 with role standby

Unregister the standby via

# repmgr standby unregister -f build/repmgr/conf/repmgr.conf --node-id=<ID of metrics-postgresql-ha-postgresql-2>

The ID should ideally be 1002.

Check the cluster status again to validate if the change reflected.

# repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact

Exit from the pod.

Scale metrics-postgresql-ha-postgresql to 2 replicas

# napp-k scale statefulsets metrics-postgresql-ha-postgresql --replicas=2

Check if the 2 metrics-postgresql pods come up fine.Wait for the other metrics pods to recover. Delete the crashlooped pods if required napp-k delete pod <pod-name>

IMPORTANT:

Follow this up with executing step #3 in the Workaround section at Metrics pods in crash loopback state on NSX Application Platform.

if you see any replication slots with active = f, follow the rest of the steps #4, #5 and #6 in the Workaround section of this KB.