Avi Load Balancer Controller Node Postgress Memory Leak

Products

VMware Avi Load Balancer

Issue/Introduction

The Avi controller nodes could experience high memory usage and VirtualService / Service Engine Metric are not loading in UI.

A PostgreSQL connection leak can lead to the following symptoms:

1. The analytics dashboard may fail to display metrics for VirtualServices and Service Engines, potentially coinciding with "METRICS_DBSYNC_FAILURE" alerts on the controller

2. metricapi_server.service in failed state on Controller Node

3. Check similar event from Controller UI

Environment

May occur in all cloud types

Affected Version:

Controller version with 22.1.5 and lower
Controller version with 30.1.1

Cause

The Metrics Manager has a bug that prevents proper closure of PostgreSQL database connections. The cleanup process occasionally fails, leaving connections open. The destructor for db_connection is not always invoked, resulting in a connection leak.

Resolution

Logs show a PostgreSQL replication failure, postgres-5049 logs:

Path: /var/log/upstart:

[2023-12-30 07:55:16,829] ERROR [postgres_service.failure_count:101] [31m[POSTGRES][metrics_db][localhost] Postgres is UP on the leader, yet replication is not working![0m [2023-12-30 07:55:17,503] INFO [postgres_service.check_streaming_from_primary:306] pg_stat_wal_receiver table: [2023-12-30 07:55:17,507] WARNING [postgres_service.<module>:447] [33m[POSTGRES][metrics_db][localhost] 1 replication failures[0m [2024-01-08 09:45:16,131] WARNING [postgres_service.run_pg_task:139] [33mFailed to execute task write_heartbeat[0m [2024-01-08 09:45:16,162] ERROR [postgres_service.write_heartbeat:180] [31m[POSTGRES][metrics_db][localhost] Could not write the heartbeat to the database[0m [2024-01-08 09:45:17,072] ERROR [postgres_service.failure_count:101] [31m[POSTGRES][metrics_db][localhost] Postgres is UP on the leader, yet replication is not working![0m [2024-01-08 09:45:17,903] INFO [postgres_service.check_streaming_from_primary:306] pg_stat_wal_receiver table: [2024-01-08 09:45:17,910] WARNING [postgres_service.<module>:447] [33m[POSTGRES][metrics_db][localhost] 1 replication failures[0m

Check metricapi_server.service status on all controller nodes
```
systemctl status metricapi_server.service
```
To identify the issue, please check the total number of connections with postgres on all controller nodes (cluster)
```
sudo -u postgres psql -p 5000 -d avi -c "SELECT count(*) from pg_stat_activity"
```

The command which helps to identify applications that holds the number of connections along with process ID, state, substring,etc:

sudo -u postgres psql -p 5000 -d avi -c "SELECT pid, state, state_change, query_start, backend_start, substring(query, 1, 100) from pg_stat_activity"
sudo -u postgres psql -p 5049 -d metrics -c "SELECT pid, state, state_change, query_start, backend_start, substring(query, 1, 100) from pg_stat_activity"

From the output, please check the how many connection are in IDLE state and timestamp of state_change (It shows from when connection is in idle state)

   pid   | state  |         state_change          |          query_start          |         backend_start         |                                              substring
---------+--------+-------------------------------+-------------------------------+-------------------------------+------------------------------------------------------------------------------------------------------
    3988 |        |                               |                               | 2025-02-28 03:01:10.858488+00 |
    3990 |        |                               |                               | 2025-02-28 03:01:10.859047+00 |
    4737 | idle   | 2025-04-29 02:08:51.671159+00 | 2025-04-29 02:08:51.671132+00 | 2025-02-28 03:01:21.896646+00
    4207 | idle   | 2025-04-29 14:35:23.991781+00 | 2025-04-29 14:35:23.991678+00 | 2025-02-28 03:01:16.728935+00
    4209 | idle   | 2025-04-29 14:35:23.89019+00  | 2025-04-29 14:35:23.890147+00 | 2025-02-28 03:01:16.793347+00

Here idle connection should not more than ~ 200-300
This command to check the detailed view of pg_processes which lists all the listening and non-listening sockets:
```
sudo netstat -planet | grep 5000
sudo netstat -planet | grep 5049
```

The issue has been fixed on the following versions:

Bug ID - AV-193663
Release Notes - AV-193663: Metrics Manager's database connections with Postgres are unclosed, causing a connection leak.
Workaround -
Option1: Restart the process-supervisor service on the leader node which lead the cluster fail over in 3 node cluster setup
```
sudo systemctl restart process-supervisor.service
```
Option2: Soft reboot the controller leader by running the following command:
```
reboot -f
```
Fix Version - 30.2.1, 30.1.2, 22.1.6, 22.1.5-2p2