When accessing the Prometheus pool analytics, the API response received is a 200 HTTP Response code without any pool data.
Avi Load Balancer and Prometheus
Scalability Issue: The current architecture exhibits limitations in efficiently gathering metrics data when the number of configured pools is large.
API Timeout: The fixed timeout duration between the AnalyticsPortal and the metricsapi_server is 30 seconds. This duration is insufficient for completing metrics collection operations when a high number of pools are present.
Non-Configurable Parameter: As of the current release [31.1.x], the API timeout value is a static, hard-coded parameter within the controller software and cannot be adjusted by the user or administrator.
This document outlines a resolution to address the API timeout issues encountered when retrieving Prometheus metrics data from the Avi Controller, particularly in environments with a large number of configured pools. The current fixed API timeout of 30 seconds between the AnalyticsPortal and the metricsapi_server often proves insufficient for comprehensive data collection from numerous entities.
As previously identified, the primary challenge is that fetching metrics data for a high volume of pools exceeds the default 30-second API timeout. This limitation results in incomplete or failed metrics retrieval, impacting the visibility and monitoring capabilities for large-scale deployments. The lack of a configurable timeout parameter on the controller exacerbates this issue.
entity_id FilteringThe proposed resolution involves making multiple, segmented API calls to the Avi Controller's Prometheus metrics endpoint. Each call will leverage the entity_id query parameter to filter the requested data, with a strict limit on the number of entity_id values included in a single request.
Instead of attempting to retrieve metrics for all pools in a single, potentially time-out-prone API call, the total set of pool UUIDs will be divided into smaller batches. A separate API request will then be made for each batch, ensuring that the number of entity_id parameters in any given URL remains below the threshold that triggers a timeout.
The API calls should adhere to the following format:
https://<controller_ip>/api/analytics/prometheus-metrics/pool?entity_id=<pool_uuid1>,<pool_uuid2>,...,<pool_uuidN>
Where:
<controller_ip>: The IP address or hostname of the Avi Controller.
pool: Specifies that metrics for pool entities are being requested.
entity_id: A comma-separated list of pool UUIDs for which metrics are desired.
entity_id LimitationTo effectively circumvent the API timeout, the maximum number of entity_id values included in a single API call must be limited to 100. This limit is crucial for ensuring that each individual request completes within the existing 30-second timeout window.
To implement this resolution, the following steps should be followed:
Obtain All Pool UUIDs: Gather the complete list of UUIDs for all pools from which metrics are required.
Batching: Divide the comprehensive list of pool UUIDs into batches, with each batch containing a maximum of 100 UUIDs.
Generate API Calls: For each batch, construct a unique API URL using the format described in Section 3.2, populating the entity_id parameter with the comma-separated UUIDs from that specific batch.
Execute Calls: Execute each generated API call sequentially or in parallel (with appropriate rate limiting to avoid overwhelming the controller) to retrieve the metrics data.