Metrics Delivery Failure alarm related to NSX Application Platform (NAPP) in NSX UI

search cancel

Metrics Delivery Failure alarm related to NSX Application Platform (NAPP) in NSX UI

book

Article ID: 441414

calendar_today

Updated On:

Products

VMware NSX VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Alarm Description:

Purpose: Indicates continuous transmission failure of metrics to Network Application Platform (NAPP)
Impact: Dashboard and API will not show the metrics from these nodes
Maintenance window required for remediation? No

Environment

NSX Application Platform (NAPP)

Resolution

Scenario A: NAPP is NOT Deployed / Has Been Removed

If NAPP is not deployed but you are still seeing metrics delivery failure alarms, the global metrics configuration must be explicitly disabled.

1. Verify Deployment Status: Run a GET request on any NSX Manager (replace <NSX_IP> with your manager's IP):

GET https://<NSX_IP>/policy/api/v1/infra/sites/default/napp/deployment/platform/status

Expected response if completely undeployed:

{     
    "overall_status": "NOT_DEPLOYED",     
    "percentage": 0 
}

2. Fetch Current Global Metrics Configuration:

If the status returns NOT_DEPLOYED, fetch the current metric configuration using a GET request:

GET https://<NSX_IP>/policy/api/v1/infra/metric-global-config

3. Disable Global Metrics Ingestion:

Take the JSON payload from the step above, change the "enabled" flag from true to false, and send it back via a PATCH request to the same endpoint:

PATCH https://<NSX_IP>/policy/api/v1/infra/metric-global-config

Example Request Body:

JSON
{     
    "enabled": false,
    "id": "metric-global-config",
    "resource_type": "MetricGlobalConfig"
    // ... keep the rest of the original fetched JSON intact
}

4. Validate:

Perform the GET request from Step 2 again to ensure "enabled": false is retained.

Scenario B: NAPP IS Deployed

If NAPP is actively deployed, you need to identify why the metrics transmission is failing by pulling the specific error status code.

1. Identify the Status Code

NSX Build >= 4.2: Look directly at the UI alarm description; the failure status code is highlighted there.
NSX Build < 4.2: Identify the "Reported by node" inside the alarm, log into that specific node, and access the logs:
- NSX Manager / NSX Edge: /var/log/syslog*
- ESXi Host: /var/run/log/nsx-syslog*
- Run the following command to find the precise failure block:
  grep "Failed to send one msg" {log_file}

2. Remediation Based on Status Code

Find the respective status code discovered in the step above and apply the relevant workaround:

Status Code	Potential Root Cause & Remediation Steps
UNAUTHENTICATED	* Please refer directly to the internal documentation for KB 316739.
`UNAVAILABLE` or `DEADLINE_EXCEEDED`	* Firewall Blockages: Check if a physical/logical firewall between the Transport Node (Edge/ESXi) and NAPP is blocking port HTTPS (443). Reference VMware Ports & Protocols for specifics. * Target Verification: Run a `GET` request to `/api/v1/infra/sites/napp/registration` on the managing NSX Manager. Check if the returned `"ingress_ip_address"` field matches your `{metrics_target_address}` and ensure DNS lookup resolves properly on the reporting node. * Widespread Alerts: If multiple Edges/Hosts show this concurrently, check for an active `nsx_application_platform_communication.manager_disconnected` alarm and resolve that first.
RESOURCE_EXHAUSTED	* Disk Pressure Guardrails: This happens when PostgreSQL pods run low on space and pause ingestion to safeguard data. * Check if the alarm mentions "System is out of resources, metrics ingestion is currently paused" or if a corresponding "Metrics Disk usage high" alarm exists. * Check for split-brain issues between database pods. If split-brain is present, perform a rollout restart: `napp-k rollout restart statefulset metrics-postgresql-ha-postgresql`
PERMISSION_DENIED	* Auth-Server Pod Failure: Check the Envoy proxy log flags via the manager console: `napp-k logs {projectcontour-envoy-pod-name} -c envoy -n projectcontour` * If you see a `UAEX` flag (UnauthorizedExternalService), it usually means the internal `auth-server` pod has crashed or stopped. Check its status using: `napp-k get pods \| grep auth` * If stopped, contact support for service recovery.

Note: If the remediation steps above do not clear the alert, restart the SHA agent manually on the "Reported by node":

NSX Manager / NSX Edge: service nsx-sha restart
ESXi Host: /etc/init.d/netopad restart

If the target Address is METRICS_MUX: This indicates the manager is trying to forward telemetry onwards to NSX+. If you encounter UNAVAILABLE here, confirm the manager is actively onboarded to NSX+ and check for missing or mismatched FQDN mappings in the X509v3 Subject Alternative Name (SAN) of your CA/Self-Signed certificate. Reset using service nsx-metrics-agents restart.

Additional Information

Symptom & Behavior

NAPP Status: Stable (Metrics and Intelligence services are reported as UP).
Alarms: Multiple intermittent alerts are raised and automatically resolved stating:
"Metrics Delivery Failure: Failed to deliver metrics to target. Status code: RESOURCE_EXHAUSTED"
Resource Profile: Core services and metric resources show minimal usage on the surface.

Root Cause

The system detected disk pressure on the metrics-postgresql pods. As a built-in safety guardrail, it temporarily paused metrics ingestion to prevent database corruption or total disk exhaustion. This is indicated by the following signature in the logs:

W1013 15:30:01.533623 39 db_access.cc:6042] DB disk usage percentage crossed 90 pausing ingestion

Step-by-Step Triage & Database Remediation

Step 1: Check for Database Split-Brain

Verify if a split-brain condition exists between the active database pods (metrics-postgresql-ha-postgresql-0 and -1).

Exec into Pod 0 and check the cluster status:

napp-k exec -it metrics-postgresql-ha-postgresql-0 bash -- repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact

Exec into Pod 1 and check the cluster status:

napp-k exec -it metrics-postgresql-ha-postgresql-1 bash -- repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact

Remediation: If a split-brain condition is visible from the outputs, perform a rolling restart of the statefulset to force re-election:
```
napp-k rollout restart statefulset metrics-postgresql-ha-postgresql
```

Step 2: Validate Inactive Replication Slots

Log into both Postgres replicas to identify the primary write node and evaluate disk usage helper functions.

Connect to Pod 0 Database:

napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1

Run the following validation queries:

SELECT pg_is_in_recovery(); 
SELECT get_disk_usage_percentage(); 
SELECT clear_inactive_replication_slots();

(Type \q to exit the Postgres prompt when done).

Connect to Pod 1 Database:

napp-k exec -it metrics-postgresql-ha-postgresql-1 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1

Run the same validation queries:

SELECT pg_is_in_recovery(); 
SELECT get_disk_usage_percentage(); 
SELECT clear_inactive_replication_slots();

Note: The node where pg_is_in_recovery() returns f (False) is your current Primary Write Node.

Step 3: Define Missing Disk and Replication Maintenance Functions

If Step 2 throws errors indicating that functions do not exist (e.g., ERROR: function get_disk_usage_percentage() does not exist), you must recreate them.

Log into the Primary Node (where recovery returned f) and execute the following database scripts sequentially:

A. Create Disk Usage Tracking Function

CREATE OR REPLACE FUNCTION get_disk_usage_percentage() 
RETURNS integer as $$ 
DECLARE 
    disk_usage integer; 
BEGIN 
    CREATE TEMP TABLE IF NOT EXISTS tmp_sys_df (content text) ON COMMIT DROP;
    
    /* Get the df output from the OS and extract the percentage usage value */
    COPY tmp_sys_df FROM PROGRAM 'OUTPUT=$(df $PGDATA | tail -n +2) && echo $OUTPUT'; 
    disk_usage=(SELECT SPLIT_PART(SPLIT_PART(content, ' ', 5),'%',1)::integer FROM tmp_sys_df); 
    RETURN disk_usage; 
END; 
$$ LANGUAGE plpgsql;

B. Create Inactive Replication Slot Cleanup Function

CREATE OR REPLACE FUNCTION clear_inactive_replication_slots() 
RETURNS void as $$ 
DECLARE 
    slot_names varchar; 
    node_count integer; 
    replication_slots_count integer; 
    inactive_replication_slots_count integer; 
BEGIN
    CREATE TEMP TABLE IF NOT EXISTS tmp_cluster_nodes (content text) ON COMMIT DROP; 
    
    /* Get the number of cluster nodes from the repmgr db */
    COPY tmp_cluster_nodes FROM PROGRAM 'PGPASSWORD=$REPMGR_PASSWORD psql -w -U $REPMGR_USERNAME -d $REPMGR_DATABASE -h 127.0.0.1 -t -c ''select count(node_id) from nodes;'''; 
    node_count=(SELECT TRIM(content)::integer FROM tmp_cluster_nodes LIMIT 1); 
    RAISE INFO 'Number of nodes in the cluster: %', node_count;

    /* Get total replication slots */
    replication_slots_count=(SELECT count(slot_name) from pg_replication_slots); 
    RAISE INFO 'Number of replication slots: %', replication_slots_count;

    /* Get inactive replication slots count */
    inactive_replication_slots_count=(SELECT count(slot_name) FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE); 
    RAISE INFO 'Number of inactive replication slots: %', inactive_replication_slots_count;

    /* Safely drop inactive slots only if total slots match or exceed cluster node count */
    IF replication_slots_count >= node_count THEN 
        FOR slot_names IN SELECT slot_name FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE 
        LOOP 
            RAISE INFO 'Dropping inactive replication slot: %', slot_names; 
            PERFORM pg_drop_replication_slot(slot_names); 
        END LOOP; 
    ELSE 
        IF inactive_replication_slots_count > 0 THEN 
            RAISE INFO 'Number of replication slots is less than number of nodes. Skipping drop optimization.'; 
        ELSE 
            RAISE INFO 'No inactive replication slots found.'; 
        END IF; 
    END IF; 
END; 
$$ LANGUAGE plpgsql;

Step 4: Execute Maintenance & Verification

Once the functions are successfully compiled on the primary database instance, execute the cleanup mechanism:

SELECT clear_inactive_replication_slots();
SELECT get_disk_usage_percentage();

Verify that both commands return clear data metrics with no syntax or missing function errors. Once processing finishes, disk pressure on the database pods will decline and the RESOURCE_EXHAUSTED alerts will clear automatically.

Feedback

thumb_up Yes

thumb_down No