Alarm Description:
Purpose: Indicates continuous transmission failure of metrics to Network Application Platform (NAPP)
Impact: Dashboard and API will not show the metrics from these nodes
Maintenance window required for remediation? No
NSX Application Platform (NAPP)
If NAPP is not deployed but you are still seeing metrics delivery failure alarms, the global metrics configuration must be explicitly disabled.
1. Verify Deployment Status: Run a GET request on any NSX Manager (replace <NSX_IP> with your manager's IP):
GET https://<NSX_IP>/policy/api/v1/infra/sites/default/napp/deployment/platform/status
Expected response if completely undeployed:
2. Fetch Current Global Metrics Configuration:
If the status returns NOT_DEPLOYED, fetch the current metric configuration using a GET request:
GET https://<NSX_IP>/policy/api/v1/infra/metric-global-config
3. Disable Global Metrics Ingestion:
Take the JSON payload from the step above, change the "enabled" flag from true to false, and send it back via a PATCH request to the same endpoint:
PATCH https://<NSX_IP>/policy/api/v1/infra/metric-global-config
Example Request Body:
{
"enabled": false,
"id": "metric-global-config",
"resource_type": "MetricGlobalConfig"
// ... keep the rest of the original fetched JSON intact
}
4. Validate:
Perform the GET request from Step 2 again to ensure "enabled": false is retained.
If NAPP is actively deployed, you need to identify why the metrics transmission is failing by pulling the specific error status code.
NSX Build >= 4.2: Look directly at the UI alarm description; the failure status code is highlighted there.
NSX Build < 4.2: Identify the "Reported by node" inside the alarm, log into that specific node, and access the logs:
NSX Manager / NSX Edge: /var/log/syslog*
ESXi Host: /var/run/log/nsx-syslog*
Run the following command to find the precise failure block:
grep "Failed to send one msg" {log_file}Find the respective status code discovered in the step above and apply the relevant workaround:
| Status Code | Potential Root Cause & Remediation Steps |
UNAUTHENTICATED | * Please refer directly to the internal documentation for KB 316739. |
or | * Firewall Blockages: Check if a physical/logical firewall between the Transport Node (Edge/ESXi) and NAPP is blocking port HTTPS (443). Reference VMware Ports & Protocols for specifics. * Target Verification: Run a * Widespread Alerts: If multiple Edges/Hosts show this concurrently, check for an active |
RESOURCE_EXHAUSTED | * Disk Pressure Guardrails: This happens when PostgreSQL pods run low on space and pause ingestion to safeguard data. * Check if the alarm mentions "System is out of resources, metrics ingestion is currently paused" or if a corresponding "Metrics Disk usage high" alarm exists. * Check for split-brain issues between database pods. If split-brain is present, perform a rollout restart:
|
PERMISSION_DENIED | * Auth-Server Pod Failure: Check the Envoy proxy log flags via the manager console:
* If you see a
* If stopped, contact support for service recovery. |
Note: If the remediation steps above do not clear the alert, restart the SHA agent manually on the "Reported by node":
NSX Manager / NSX Edge: service nsx-sha restart
ESXi Host: /etc/init.d/netopad restart
If the target Address is METRICS_MUX: This indicates the manager is trying to forward telemetry onwards to NSX+. If you encounter UNAVAILABLE here, confirm the manager is actively onboarded to NSX+ and check for missing or mismatched FQDN mappings in the X509v3 Subject Alternative Name (SAN) of your CA/Self-Signed certificate. Reset using service nsx-metrics-agents restart.
NAPP Status: Stable (Metrics and Intelligence services are reported as UP).
Alarms: Multiple intermittent alerts are raised and automatically resolved stating:
"Metrics Delivery Failure: Failed to deliver metrics to target. Status code: RESOURCE_EXHAUSTED"
Resource Profile: Core services and metric resources show minimal usage on the surface.
The system detected disk pressure on the metrics-postgresql pods. As a built-in safety guardrail, it temporarily paused metrics ingestion to prevent database corruption or total disk exhaustion. This is indicated by the following signature in the logs:
W1013 15:30:01.533623 39 db_access.cc:6042] DB disk usage percentage crossed 90 pausing ingestion
Verify if a split-brain condition exists between the active database pods (metrics-postgresql-ha-postgresql-0 and -1).
Exec into Pod 0 and check the cluster status:
napp-k exec -it metrics-postgresql-ha-postgresql-0 bash -- repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
Exec into Pod 1 and check the cluster status:
napp-k exec -it metrics-postgresql-ha-postgresql-1 bash -- repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
Remediation: If a split-brain condition is visible from the outputs, perform a rolling restart of the statefulset to force re-election:
napp-k rollout restart statefulset metrics-postgresql-ha-postgresql
Log into both Postgres replicas to identify the primary write node and evaluate disk usage helper functions.
Connect to Pod 0 Database:
napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1
Run the following validation queries:
SELECT pg_is_in_recovery();
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();
(Type \q to exit the Postgres prompt when done).
Connect to Pod 1 Database:
napp-k exec -it metrics-postgresql-ha-postgresql-1 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1
Run the same validation queries:
SELECT pg_is_in_recovery();
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();
Note: The node where
pg_is_in_recovery()returnsf(False) is your current Primary Write Node.
If Step 2 throws errors indicating that functions do not exist (e.g., ERROR: function get_disk_usage_percentage() does not exist), you must recreate them.
Log into the Primary Node (where recovery returned f) and execute the following database scripts sequentially:
CREATE OR REPLACE FUNCTION get_disk_usage_percentage()
RETURNS integer as $$
DECLARE
disk_usage integer;
BEGIN
CREATE TEMP TABLE IF NOT EXISTS tmp_sys_df (content text) ON COMMIT DROP;
/* Get the df output from the OS and extract the percentage usage value */
COPY tmp_sys_df FROM PROGRAM 'OUTPUT=$(df $PGDATA | tail -n +2) && echo $OUTPUT';
disk_usage=(SELECT SPLIT_PART(SPLIT_PART(content, ' ', 5),'%',1)::integer FROM tmp_sys_df);
RETURN disk_usage;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION clear_inactive_replication_slots()
RETURNS void as $$
DECLARE
slot_names varchar;
node_count integer;
replication_slots_count integer;
inactive_replication_slots_count integer;
BEGIN
CREATE TEMP TABLE IF NOT EXISTS tmp_cluster_nodes (content text) ON COMMIT DROP;
/* Get the number of cluster nodes from the repmgr db */
COPY tmp_cluster_nodes FROM PROGRAM 'PGPASSWORD=$REPMGR_PASSWORD psql -w -U $REPMGR_USERNAME -d $REPMGR_DATABASE -h 127.0.0.1 -t -c ''select count(node_id) from nodes;''';
node_count=(SELECT TRIM(content)::integer FROM tmp_cluster_nodes LIMIT 1);
RAISE INFO 'Number of nodes in the cluster: %', node_count;
/* Get total replication slots */
replication_slots_count=(SELECT count(slot_name) from pg_replication_slots);
RAISE INFO 'Number of replication slots: %', replication_slots_count;
/* Get inactive replication slots count */
inactive_replication_slots_count=(SELECT count(slot_name) FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE);
RAISE INFO 'Number of inactive replication slots: %', inactive_replication_slots_count;
/* Safely drop inactive slots only if total slots match or exceed cluster node count */
IF replication_slots_count >= node_count THEN
FOR slot_names IN SELECT slot_name FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE
LOOP
RAISE INFO 'Dropping inactive replication slot: %', slot_names;
PERFORM pg_drop_replication_slot(slot_names);
END LOOP;
ELSE
IF inactive_replication_slots_count > 0 THEN
RAISE INFO 'Number of replication slots is less than number of nodes. Skipping drop optimization.';
ELSE
RAISE INFO 'No inactive replication slots found.';
END IF;
END IF;
END;
$$ LANGUAGE plpgsql;
Once the functions are successfully compiled on the primary database instance, execute the cleanup mechanism:
SELECT clear_inactive_replication_slots();
SELECT get_disk_usage_percentage();
Verify that both commands return clear data metrics with no syntax or missing function errors. Once processing finishes, disk pressure on the database pods will decline and the RESOURCE_EXHAUSTED alerts will clear automatically.