One of the following is seen on the NSX UI
Alarm for Metrics Delivery Failure
System → NSX Application Platform shows the status is Degraded and the Metrics service is down.
Checking status of the pods by running the following commands as root on the CLI of the NSX manager shows the metrics-postgresql-ha-postgresql-0 and/or metrics-postgresql-ha-postgresql-1 pods are in a CrashLoopBackOff state
|
A further check into the metrics-postgresql-ha-postgresql-0 logs indicates host translation issues
|
VMware NSX
Before 4.1.1 metrics-postresql service was running with 3 replicas, i.e. metrics-postgresql-ha-postgresql-
0
, metrics-postgresql-ha-postgresql-1 and metrics-postgresql-ha-postgresql-2. A
fter the upgrade, the number of replicas has been changed to 2 i.e. only metrics-postgresql-ha-postgresql-
0
and metrics-postgresql-ha-postgresql-1 would be available.
In this case, metrics-postgresql-ha-postgresql-2 is still registered as a standby node (stale entry in postgresql) even when the replica count is 2, and the master node tries to connect to it and fails.
Scale metrics-postgresql-ha-postgresql to 3 replicas
# napp-k scale statefulsets metrics-postgresql-ha-postgresql --replicas=
3
Unregister metrics-postgresql-ha-postgresql-2 as a standby
# napp-k exec -it metrics-postgresql-ha-postgresql-
0
bash
# repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
Sample Output
postgres
@metrics
-postgresql-ha-postgresql-
0
:/opt$ repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
ID | Name | Role | Status | Upstream | Location | Prio. | TLI
------+------------------------------------+---------+----------------------+----------+----------+-------+-----
1000
| metrics-postgresql-ha-postgresql-
0
| primary | * running | |
default
|
100
|
17
1001
| metrics-postgresql-ha-postgresql-
1
| standby | * running | |
default
|
100
|
18
1002
| metrics-postgresql-ha-postgresql-
2
| standby | * running | |
default
|
100
|
19
This will show you the current cluster details.
Check Role and Status value for metrics-postgresql-ha-postgresql-0 and metrics-postgresql-ha-postgresql-1, the status of these 2 should be either running OR running as primary.
If you see an entry for metrics-postgresql-ha-postgresql-2 with role standby
Unregister the standby via
# repmgr standby unregister -f build/repmgr/conf/repmgr.conf --node-id=<ID of metrics-postgresql-ha-postgresql-
2
>
The ID should ideally be 1002.
Check the cluster status again to validate if the change reflected.
# repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
Exit from the pod.
Scale metrics-postgresql-ha-postgresql to 2 replicas
# napp-k scale statefulsets metrics-postgresql-ha-postgresql --replicas=2
Check if the 2 metrics-postgresql pods come up fine.Wait for the other metrics pods to recover. Delete the crashlooped pods if required napp-k delete pod <pod-name>
IMPORTANT:
Follow this up with executing step #3 in the Workaround section at Metrics pods in crash loopback state on NSX Application Platform.
if you see any replication slots with active = f, follow the rest of the steps #4, #5 and #6 in the Workaround section of this KB.