The Avi controller nodes could experience high memory usage and VirtualService / Service Engine Metric are not loading in UI.
A PostgreSQL connection leak can lead to the following symptoms:
1. The analytics dashboard may fail to display metrics for VirtualServices and Service Engines, potentially coinciding with "METRICS_DBSYNC_FAILURE" alerts on the controller
2. metricapi_server.service in failed state on Controller Node
3. Check similar event from Controller UI
May occur in all cloud types
Affected Version:
Controller version with 22.1.5 and lower
Controller version with 30.1.1
The Metrics Manager has a bug that prevents proper closure of PostgreSQL database connections. The cleanup process occasionally fails, leaving connections open. The destructor for db_connection is not always invoked, resulting in a connection leak.
Path: /var/log/upstart:
[2023-12-30 07:55:16,829] ERROR [postgres_service.failure_count:101] [31m[POSTGRES][metrics_db][localhost] Postgres is UP on the leader, yet replication is not working![0m
[2023-12-30 07:55:17,503] INFO [postgres_service.check_streaming_from_primary:306] pg_stat_wal_receiver table:
[2023-12-30 07:55:17,507] WARNING [postgres_service.<module>:447] [33m[POSTGRES][metrics_db][localhost] 1 replication failures[0m
[2024-01-08 09:45:16,131] WARNING [postgres_service.run_pg_task:139] [33mFailed to execute task write_heartbeat[0m
[2024-01-08 09:45:16,162] ERROR [postgres_service.write_heartbeat:180] [31m[POSTGRES][metrics_db][localhost] Could not write the heartbeat to the database[0m
[2024-01-08 09:45:17,072] ERROR [postgres_service.failure_count:101] [31m[POSTGRES][metrics_db][localhost] Postgres is UP on the leader, yet replication is not working![0m
[2024-01-08 09:45:17,903] INFO [postgres_service.check_streaming_from_primary:306] pg_stat_wal_receiver table:
[2024-01-08 09:45:17,910] WARNING [postgres_service.<module>:447] [33m[POSTGRES][metrics_db][localhost] 1 replication failures[0m
systemctl status metricapi_server.service
sudo -u postgres psql -p 5000 -d avi -c "SELECT count(*) from pg_stat_activity"
sudo -u postgres psql -p 5000 -d avi -c "SELECT pid, state, state_change, query_start, backend_start, substring(query, 1, 100) from pg_stat_activity"
sudo -u postgres psql -p 5049 -d metrics -c "SELECT pid, state, state_change, query_start, backend_start, substring(query, 1, 100) from pg_stat_activity"
From the output, please check the how many connection are in IDLE state and timestamp of state_change (It shows from when connection is in idle state)
pid | state | state_change | query_start | backend_start | substring
---------+--------+-------------------------------+-------------------------------+-------------------------------+------------------------------------------------------------------------------------------------------
3988 | | | | 2025-02-28 03:01:10.858488+00 |
3990 | | | | 2025-02-28 03:01:10.859047+00 |
4737 | idle | 2025-04-29 02:08:51.671159+00 | 2025-04-29 02:08:51.671132+00 | 2025-02-28 03:01:21.896646+00
4207 | idle | 2025-04-29 14:35:23.991781+00 | 2025-04-29 14:35:23.991678+00 | 2025-02-28 03:01:16.728935+00
4209 | idle | 2025-04-29 14:35:23.89019+00 | 2025-04-29 14:35:23.890147+00 | 2025-02-28 03:01:16.793347+00
sudo netstat -planet | grep 5000
sudo netstat -planet | grep 5049
The issue has been fixed on the following versions:
sudo systemctl restart process-supervisor.service
Option2: Soft reboot the controller leader by running the following command:
reboot -f