Avi Load Balancer Controller Node Postgress Memory Leak - METRIC_DBSYNC_FAILURE
search cancel

Avi Load Balancer Controller Node Postgress Memory Leak - METRIC_DBSYNC_FAILURE

book

Article ID: 384404

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

The Avi controller nodes could experience high memory usage and VirtualService / Service Engine Metric are not loading in UI.

A PostgreSQL connection leak can lead to the following symptoms:

     1. The analytics dashboard may fail to display metrics for VirtualServices and Service Engines, potentially coinciding with "METRICS_DBSYNC_FAILURE" alerts on the controller

   

     2. metricapi_server.service in failed state on Controller Node
    
     3. Check similar event from Controller UI

Environment

May occur in all cloud types

Affected Version:

Controller version with 22.1.5 and lower
Controller version with 30.1.1

Cause

The Metrics Manager has a bug that prevents proper closure of PostgreSQL database connections. The cleanup process occasionally fails, leaving connections open. The destructor for db_connection is not always invoked, resulting in a connection leak.

Resolution

  • Logs show a PostgreSQL replication failure, postgres-5049 logs:

Path: /var/log/upstart:

[2023-12-30 07:55:16,829] ERROR [postgres_service.failure_count:101] [POSTGRES][metrics_db][localhost] Postgres is UP on the leader, yet replication is not working!
[2023-12-30 07:55:17,503] INFO [postgres_service.check_streaming_from_primary:306] pg_stat_wal_receiver table:
[2023-12-30 07:55:17,507] WARNING [postgres_service.<module>:447] [POSTGRES][metrics_db][localhost] 1
replication failures
[2024-01-08 09:45:16,131] WARNING [postgres_service.run_pg_task:139] Failed to execute task write_heartbeat
[2024-01-08 09:45:16,162] ERROR [postgres_service.write_heartbeat:180] [POSTGRES][metrics_db][localhost] Could not write the heartbeat to the database
[2024-01-08 09:45:17,072] ERROR [postgres_service.failure_count:101] [POSTGRES][metrics_db][localhost] Postgres is UP on the leader, yet replication is not working!
[2024-01-08 09:45:17,903] INFO [postgres_service.check_streaming_from_primary:306] pg_stat_wal_receiver table:
[2024-01-08 09:45:17,910] WARNING [postgres_service.<module>:447] [POSTGRES][metrics_db][localhost] 1 replication failures

  • Check metricapi_server.service status on all controller nodes
    systemctl status metricapi_server.service


  • To identify the issue, please check the  total number of connections with postgres on all controller nodes (cluster)
    sudo -u postgres psql -p 5000 -d avi -c "SELECT count(*) from pg_stat_activity"
     
  • The command which helps to identify applications that holds the number of connections along with process ID, state, substring,etc:
    sudo -u postgres psql -p 5000 -d avi -c "SELECT pid, state, state_change, query_start, backend_start, substring(query, 1, 100) from pg_stat_activity"
    sudo -u postgres psql -p 5049 -d metrics -c "SELECT pid, state, state_change, query_start, backend_start, substring(query, 1, 100) from pg_stat_activity"
    
    From the output, please check the how many connection are in IDLE state and timestamp of state_change (It shows from when connection is in idle state)
    
       pid   | state  |         state_change          |          query_start          |         backend_start         |                                              substring
    ---------+--------+-------------------------------+-------------------------------+-------------------------------+------------------------------------------------------------------------------------------------------
        3988 |        |                               |                               | 2025-02-28 03:01:10.858488+00 |
        3990 |        |                               |                               | 2025-02-28 03:01:10.859047+00 |
        4737 | idle   | 2025-04-29 02:08:51.671159+00 | 2025-04-29 02:08:51.671132+00 | 2025-02-28 03:01:21.896646+00
        4207 | idle   | 2025-04-29 14:35:23.991781+00 | 2025-04-29 14:35:23.991678+00 | 2025-02-28 03:01:16.728935+00
        4209 | idle   | 2025-04-29 14:35:23.89019+00  | 2025-04-29 14:35:23.890147+00 | 2025-02-28 03:01:16.793347+00
    
    
  • Here idle connection should not more than ~ 200-300
  • This command to check the detailed view of pg_processes which lists all the listening and non-listening sockets:
    sudo netstat -planet | grep 5000
    sudo netstat -planet | grep 5049

The issue has been fixed on the following versions:

  • Bug ID - AV-193663
  • Release Notes - AV-193663: Metrics Manager's database connections with Postgres are unclosed, causing a connection leak.
  • Workaround -
    Option1:
    Restart the process-supervisor service on the leader node which lead the cluster fail over in 3 node cluster setup
    sudo systemctl restart process-supervisor.service

    Option2: Soft reboot the controller leader by running the following command:

    reboot -f


  • Fix Version - 30.2.1, 30.1.2, 22.1.6, 22.1.5-2p2

 

Additional Information