Alarm for Metrics Delivery Failure is seen in the NSX T UI

search cancel

Alarm for Metrics Delivery Failure is seen in the NSX T UI

book

Article ID: 320809

calendar_today

Updated On:

Products

VMware NSX VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Alarm Description:

Purpose: Indicates continuous transmission failure of metrics to Network Application Platform (NAPP)
Impact: Dashboard and API will not show the metrics from these nodes
Maintenance window required for remediation? No

Environment

VMware NSX-T Data Center

Resolution

- Check if NSX Application Platform (NAPP) is deployed or not.
  - If NSX Application Platform (NAPP) is undeployed -
    - To check if NSX Application Platform (NAPP) is undeployed or not, please call below GET API. Replace <NSX_IP> in below endpoint with NSX Manager IP.
    - Endpoint - https://<NSX_IP>/policy/api/v1/infra/sites/default/napp/deployment/platform/status
    - Sample Response -
      
      {
      
      "overall_status": "NOT_DEPLOYED",
      
      "percentage": 0
      
      }
    - If response says NOT_DEPLOYED, follow below steps.
    - Call GET API on any NSX Manager. Replace <NSX_IP> in below endpoint with NSX Manager IP.
      - Endpoint - https://<NSX_IP>/policy/api/v1/infra/metric-global-config
      - Send API Request.
      - Example Response -
        
        {
        
            "enabled": true,
        
            "deployment_id": "1cbe7eab-####-####-####-##########2a",
        
            "resource_type": "MetricGlobalConfig",
        
            "id": "metric-global-config",
        
            "display_name": "metric-global-config",
        
            "path": "/infra/metric-global-config",
        
            "relative_path": "metric-global-config",
        
            "parent_path": "/infra",
        
            "remote_path": "",
        
            "unique_id": "ebec51a6-####-####-####-##########41",
        
            "realization_id": "ebec51a6-####-####-####-##########41",
        
            "owner_id": "f53f33fc-9360-####-####-##########09",
        
            "marked_for_delete": false,
        
            "overridden": false,
        
            "_create_time": 1706087284750,
        
            "_create_user": "system",
        
            "_last_modified_time": 1706091856611,
        
            "_last_modified_user": "XXXXXXXX",
        
            "_system_owned": false,
        
            "_protection": "NOT_PROTECTED",
        
            "_revision": 1
        
        }
    - Call PATCH API on any NSX Manager.
      - Endpoint - https://<NSX_IP>/policy/api/v1/infra/metric-global/config
      - Body - Replace the "enabled": true to "enabled" : false in JSON Response received from above GET API Call.
      - Example Request Body -
        
        {
        
            "enabled": false, ----------------------------------> Notice that this needs to be changed from true to false.
        
            "deployment_id": "1cbe7eab-####-####-####-##########2a",
        
            "resource_type": "MetricGlobalConfig",
        
            "id": "metric-global-config",
        
            "display_name": "metric-global-config",
        
            "path": "/infra/metric-global-config",
        
            "relative_path": "metric-global-config",
        
            "parent_path": "/infra",
        
            "remote_path": "",
        
            "unique_id": "ebec51a6-####-####-####-##########41",
        
            "realization_id": "ebec51a6-####-####-####-##########41",
        
            "owner_id": "f53f33fc-9360-####-####-##########09",
        
            "marked_for_delete": false,
        
            "overridden": false,
        
            "_create_time": 1706087284750,
        
            "_create_user": "system",
        
            "_last_modified_time": 1706091856611,
        
            "_last_modified_user": "XXXXXXXX",
        
            "_system_owned": false,
        
            "_protection": "NOT_PROTECTED",
        
            "_revision": 1
        
        }
      - Send API Request.
  - If NSX Application Platform (NAPP) is deployed :
    - Get the status code for the delivery failure:
      - NSX build is equal to or higher than 4.2:
        
        It is included in the alarm description, which is the highlighted part.
        
        Example:
      - NSX build is lower than 4.2:
        
        Check the "Reported by node" from this alarm, login it and access the logs
        
        If it is NSX Manager or NSX Edge: log location is /var/log/syslog*
        
        If it is ESXi: log location is /var/run/log/nsx-syslog*
        
        Run the command `grep "Failed to send one msg" {log file}` to check the transmission failure:
        
        Example: 20xx-xx-xxTxx:xx:xx.xxxZ nsx-manager NSX 255969 - [nsx@676 comp="nsx-manager" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="tsdb-sender-metrics_mux"] Failed to send one msg timestamp: 1#012entity: POLICY_EDGE_NODE#012entity_id: "<UUID>"#012node_id: "<UUID>"#012nsx_site_id: "<UUID>"#012tenant_id: "<ID>"#012org_id: "<UUID>"#012system {#012 obj_id: "1"#012 mem_available: 1#012}#012:#012 <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>#012 Traceback (most recent call last):#012 File "/opt/vmware/nsx-netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 596, in send_metrics#012 response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 946, in __call__#012 return _end_unary_response_blocking(state, call, False, None)#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking#012 raise _InactiveRpcError(state)#012grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>

- - - Known issues and remediation based on status code:

- - - - UNAUTHENTICATED:
        
        For more details, please refer to KB 94953.
      - UNAVAILABLE or DEADLINE_EXCEED:
        
        If there is extra firewall between TN node(including EDGE node and ESXi node) and NAPP platform, please check if there is policy denying the traffic from TN to NAPP, since the traffic requires HTTPS(443). Refer to https://ports.esp.vmware.com/home/NSX for more details on needed open ports.
        
        Login to the NSX Manager managing this node, get the profile using API /api/v1/infra/sites/napp/registration,
        
        Example:
        
        {
        
            "napp_registration_results": [
        
                {
        
                    "cluster_id": "4ad2272c-####-####-####-##########e1",
        
                    "cluster_name": "NSX #### ####",
        
                    "message_bus_ip_address": "192.##.##.##",
        
                    "ingress_ip_address": "######.com",
        
                    "status": "DEPLOYMENT_SUCCESSFUL",
        
                    "is_intelligence_enabled": false,
        
                    "is_metric_enabled": true,
        
                    "resource_type": "NappRegistration",
        
                    "id": "4ad2272c-####-####-####-##########e1",
        
                    "display_name": "4ad2272c-####-####-####-##########e1",
        
                    "path": "/infra/settings/napp/napp-appliance-info/4ad2272c-####-####-####-##########e1",
        
                    "relative_path": "4ad2272c-####-####-####-##########e1",
        
                    "parent_path": "/infra",
        
                    "remote_path": "",
        
                    "unique_id": "4b5268b8-####-####-####-##########4e",
        
                    "realization_id": "4b5268b8-####-####-####-##########4e",
        
                    "owner_id": "2313c7d2-####-####-####-##########e5",
        
                    "marked_for_delete": false,
        
                    "overridden": false,
        
                    "_create_time": 1709187097362,
        
                    "_create_user": "nsx_policy",
        
                    "_last_modified_time": 1709187320744,
        
                    "_last_modified_user": "####",
        
                    "_system_owned": false,
        
                    "_protection": "REQUIRE_OVERRIDE",
        
                    "_revision": 1
        
                }
        
            ]
        
        }
        
        Check if the "ingress_ip_address" field in the API's response is same as {metrics_target_address},
        
        Check and fix the connectivity from "Reported by node" to {metrics_target_address},
        
        Check if DNS lookup result for "ingress_ip_address" in the API's response is the real Network Application Platform(NAPP) address on "Reported by node".
        
        If you are seeing a lot of these alarms from various Edges, ESX and UA nodes check if there is an alarm for "nsx_application_platform_communication.manager_disconnected". Follow the remediation for this alarm first.
      - UNAVAILABLE (Only execute the below when there is no NAPP in the environment/NAPP has been deleted in the environment, but you still see these alarms):
        
        This could be because although NAPP has been removed totally, the metrics still shows that it's enabled. We need to disable the Global Metrics Configuration using the below:
        
        Using Postman/other API platform tool, with admin authorization(Basic Auth), do a GET to the below:
        
        https://<nsx-manager-domain/ip>/policy/api/v1/infra/metric-global-config
        
        The output would look like this:
        
        {
           "enabled": true,
           "deployment_id": "62a#####-####-####-####-######eb86c5",
           "resource_type": "MetricGlobalConfig",
           "id": "metric-global-config",
           "display_name": "metric-global-config",
           "path": "/infra/metric-global-config",
           "relative_path": "metric-global-config",
           "parent_path": "/infra",
            "remote_path": "",
           "unique_id": "ebcd####-####-####-####-######70173e",
           "realization_id": "ebcd####-####-####-####-######70173e",
           "owner_id": "1ff0####-####-####-####-######95f01",
           "marked_for_delete": false,
           "overridden": false,
            "_system_owned": false,
           "_protection": "NOT_PROTECTED",
           "_create_time": 1732176932165,
           "_create_user": "system",
           "_last_modified_time": 1732310715434,
           "_last_modified_user": "######",
           "_revision": 1
        }
        
        Now, modify the enabled flag to “false” and paste the above in the body, and do a PATCH API to the same URL above.
        
        Validate using GET again to check whether enabled is “false”.
      - RESOURCE_EXHAUSTED:
        
        Check if the alarm Runtime Details has the following error
        "System is out of resources, metrics ingestion is currently paused. Please check for relevant alarms and perform the recommended actions."
        
        Check if there is also an alarm for
        “NSX Application Platform Health alarm : Metrics Disk usage high/very high”
        
        Check if you have the case specified in KB 95943 → Workaround → Step#2
        
        If yes, follow KB 95943 → Workaround → Step#3, Step#4 and Step#5
        
        If no, follow KB 93274
        Note: Do check Additional info for more t-shoot on - RESOURCE EXHAUSTED related issue.
      - PERMISSION_DENIED:
      - Check the API response from envoy log:
        
        Get the projectcontour pod name and get envoy log from this pod using following commands on NSX manager:
        
        napp-k get pods -n projectcontour
        
        Example:
        
        root@nsx-mgr-1:~# napp-k get pods -n projectcontour
        
        NAME                                      READY   STATUS    RESTARTS   AGE
        
        projectcontour-contour-77669b45bd-lqdx4   1/1     Running   0          9d
        
        projectcontour-envoy-8qjqz                2/2     Running   0          9d
        
        projectcontour-envoy-dgqc6                2/2     Running   0          9d
        
        projectcontour-envoy-pssjw                2/2     Running   0          9d
        
        projectcontour-envoy-pt9fk                2/2     Running   0          9d
        
        projectcontour-envoy-qn79h                2/2     Running   0          9d
        
        projectcontour-envoy-wchcd                2/2     Running   0          9d
        
        napp-k logs {pod-name} -c envoy -n projectcontour
        
        Example:root@nsx-mgr-1:~# napp-k logs projectcontour-envoy-8qjqz -c envoy -n projectcontour
        
        In this log, user would get API response's flag.
        
        Example:
        
        [2024-05-31T08:55:21.662Z] "POST /MetricsMgrGrpc/StatusMetricsHealthCheck HTTP/2" 200 UAEX 0 0 0 - "10.x.x.x" "grpc-python/1.47.5 grpc-c/25.0.0 (linux; chttp2)" "57526c59-####-####-####-##########ae" "cloudnative.nsbucqesystem.net:443" "-"
        
        UAEX - stands for UnauthorizedExternalService
        
        If API response flag is UAEX, the possible reason is auth-server pod stops. To comfirm it, check auth-server pod status using following command:
        
        napp-k get pods | grep auth
        
        Example:
        
        root@nsx-mgr-1:~# napp-k get pods | grep auth
        
        authserver-6bc9b59ddd-sbc5g                                       1/1     Running     0               7d4h
        
        If auth-server pod stops, reach out to support for further investigation.

- - If above remediation does not resolve the alarm, restart SHA agent on "Reported by node":
    - If it is NSX Manager or NSX Edge: service nsx-sha restart
    - If it is ESXi: /etc/init.d/netopad restart
- NOTE : IF YOUR NSX IS ONBOARDED TO SSP INSTEAD OF NAPP , PLEASE REFER TO THE BELOW KB :
  
  https://knowledge.broadcom.com/external/article?articleNumber=389600

Additional Information

Metrics Delivery Failure: Failed to deliver metrics to target. Status code: RESOURCE_EXHAUSTED

Issue/Introduction *
- NAPP Stable - Metrics and Intelligence Up
- Multiple alarms of Metrics Delivery Failure raised and resolved.
" Metrics Delivery Failure: Failed to deliver metrics to target. Status code: RESOURCE_EXHAUSTED "
- core-services + resource for Metrics hardly being used.

Environment
NAPP 4.2.0.0.0.24124105
NSX 4.2.0.2.0.24278659

Cause:
when the system detected disk pressure on the metrics postgresql pods and as a guardrail paused the metrics ingestion to avoid additional pressure on the disk. This is the case which lead to the alarms

{"log":"2024-10-13T15:30:01.533773131Z stderr F W1013 15:30:01.533623 39 db_access.cc:6042] DB disk usage percentage crossed 90 pausing ingestion"

Resolution
1) Check if there is split bring across metrics-postgresql-ha-postgresql pods if split-brain perform the "napp-k rollout restart statefulset metrics-postgresql-ha-postgresql"
napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact

napp-k exec -it metrics-postgresql-ha-postgresql-1 bash
repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact

A) On metrics-postgresql-ha-postgresql-0

napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1

SELECT pg_is_in_recovery();
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();

B) On metrics-postgresql-ha-postgresql-1

napp-k exec -it metrics-postgresql-ha-postgresql-1 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1

SELECT pg_is_in_recovery();
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();

If any of the above gives the following errors

ERROR: function get_disk_usage_percentage() does not exist (OR) ERROR: function clear_inactive_replication_slots() does not exist

Please help execute the following psql commands on the node where SELECT pg_is_in_recovery() returns f.

a) /* Function returns the current disk usage percentage of the Postgres data dir */

CREATE OR REPLACE FUNCTION get_disk_usage_percentage() RETURNS integer as $$
DECLARE
disk_usage integer;
BEGIN
CREATE TEMP TABLE IF NOT EXISTS tmp_sys_df (content text) ON COMMIT DROP;

b)/* Get the df output from the OS and extract the percentage usage value from the same */

COPY tmp_sys_df FROM PROGRAM 'OUTPUT=$(df $PGDATA | tail -n +2) && echo $OUTPUT';
disk_usage=(SELECT SPLIT_PART(SPLIT_PART(content, ' ', 5),'%',1)::integer FROM tmp_sys_df);
RETURN disk_usage;
END;
$$ LANGUAGE plpgsql;

c) /* Function checks for inactive replication slots and drops them */

CREATE OR REPLACE FUNCTION clear_inactive_replication_slots() RETURNS void as $$
DECLARE
slot_names varchar;
node_count integer;
replication_slots_count integer;
inactive_replication_slots_count integer;
BEGIN

d) /* Get the number of nodes in the cluster from the repmgr db*/

CREATE TEMP TABLE IF NOT EXISTS tmp_cluster_nodes (content text) ON COMMIT DROP;
COPY tmp_cluster_nodes FROM PROGRAM 'PGPASSWORD=$REPMGR_PASSWORD psql -w -U $REPMGR_USERNAME -d $REPMGR_DATABASE -h 127.0.0.1 -t -c ''select count(node_id) from nodes;''';
node_count=(SELECT TRIM(content)::integer FROM tmp_cluster_nodes LIMIT 1);
RAISE INFO 'Number of nodes in the cluster %', node_count;

5) /* Get the total replication slots */

replication_slots_count=(SELECT count(slot_name) from pg_replication_slots);
RAISE INFO 'Number of replication slots %', replication_slots_count;

6) /* Get the inactive replication slots count */

inactive_replication_slots_count=(SELECT count(slot_name) FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE);
RAISE INFO 'Number of inactive replication slots %', inactive_replication_slots_count;

7) /* Drop inactive replication slots only if the number of replication slots are more than or equal to the number of nodes in the cluster */

IF replication_slots_count >= node_count THEN
FOR slot_names IN SELECT slot_name FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE
LOOP
RAISE INFO 'Dropping inactive replication slot %', slot_names;
PERFORM pg_drop_replication_slot(slot_names);
END LOOP;
ELSE
IF inactive_replication_slots_count > 0 THEN
RAISE INFO 'Number of replication slots less than number of nodes. Not dropping the inactive replication slots';
ELSE
RAISE INFO 'No inactive replication slots';
END IF;
END IF;
END;
$$ LANGUAGE plpgsql;

8) *Execute the inactive replication slot cleanup upfront once*/
SELECT clear_inactive_replication_slots();

Once this is done, execute

SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();

and make sure there are no errors.

Once this is successfully done, the alarms should go away and entries like the one below should not be seen in the logs for the node where SELECT pg_is_in_recovery() returns f

{"log":"2024-10-15T08:15:00.879763632Z stdout F 2024-10-15 08:15:00.879 GMT [9981] ERROR: function get_disk_usage_percentage() does not exist at character 8"
{"log":"2024-10-15T08:15:00.879799552Z stdout F 2024-10-15 08:15:00.879 GMT [9981] HINT: No function matches the given name and argument types. You might need to add explicit type casts."
{"log":"2024-10-15T08:15:00.8798088Z stdout F 2024-10-15 08:15:00.879 GMT [9981] STATEMENT: SELECT get_disk_usage_percentage();"

Feedback

thumb_up Yes

thumb_down No