Alarm Description:
Sample Response -
Example Response -
Example Request Body -
Example: 20xx-xx-xxTxx:xx:xx.xxxZ nsx-manager NSX 255969 - [nsx@676 comp="nsx-manager" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="tsdb-sender-metrics_mux"] Failed to send one msg timestamp: 1#012entity: POLICY_EDGE_NODE#012entity_id: "<UUID>"#012node_id: "<UUID>"#012nsx_site_id: "<UUID>"#012tenant_id: "<ID>"#012org_id: "<UUID>"#012system {#012 obj_id: "1"#012 mem_available: 1#012}#012:#012 <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>#012 Traceback (most recent call last):#012 File "/opt/vmware/nsx-netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 596, in send_metrics#012 response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 946, in __call__#012 return _end_unary_response_blocking(state, call, False, None)#012 File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking#012 raise _InactiveRpcError(state)#012grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1671532749.874223758","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671532749.874222022","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>
Example:
Using Postman/other API platform tool, with admin authorization(Basic Auth), do a GET to the below:
https://<nsx-manager-domain/ip>/policy/api/v1/infra/metric-global-config
The output would look like this:
{
"enabled": true,
"deployment_id": "62acda7c-a3fb-4e4d-b784-2abbf6eb86c5",
"resource_type": "MetricGlobalConfig",
"id": "metric-global-config",
"display_name": "metric-global-config",
"path": "/infra/metric-global-config",
"relative_path": "metric-global-config",
"parent_path": "/infra",
"remote_path": "",
"unique_id": "ebcdee3a-ad57-443c-b490-44a1df70173e",
"realization_id": "ebcdee3a-ad57-443c-b490-44a1df70173e",
"owner_id": "1ff09d96-522a-495a-b464-429779595f01",
"marked_for_delete": false,
"overridden": false,
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_create_time": 1732176932165,
"_create_user": "system",
"_last_modified_time": 1732310715434,
"_last_modified_user": "napp_platform_egress",
"_revision": 1
}
Now, modify the enabled flag to “false” and paste the above in the body, and do a PATCH API to the same URL above.
Validate using GET again to check whether enabled is “false”.
"System is out of resources, metrics ingestion is currently paused. Please check for relevant alarms and perform the recommended actions."
“NSX Application Platform Health alarm : Metrics Disk usage high/very high”
Example:
Example:root@nsx-mgr-1:~# napp-k logs projectcontour-envoy-8qjqz -c envoy -n projectcontour
Example:
[2024-05-31T08:55:21.662Z] "POST /MetricsMgrGrpc/StatusMetricsHealthCheck HTTP/2" 200 UAEX 0 0 0 - "10.20.132.100" "grpc-python/1.47.5 grpc-c/25.0.0 (linux; chttp2)" "57526c59-####-####-####-##########ae" "cloudnative.nsbucqesystem.net:443" "-"
Example:
Metrics Delivery Failure: Failed to deliver metrics to target. Status code: RESOURCE_EXHAUSTED
Issue/Introduction *
- NAPP Stable - Metrics and Intelligence Up
- Multiple alarms of Metrics Delivery Failure raised and resolved.
" Metrics Delivery Failure: Failed to deliver metrics to target. Status code: RESOURCE_EXHAUSTED "
- core-services + resource for Metrics hardly being used.
Environment
NAPP 4.2.0.0.0.24124105
NSX 4.2.0.2.0.24278659
Cause:
when the system detected disk pressure on the metrics postgresql pods and as a guardrail paused the metrics ingestion to avoid additional pressure on the disk. This is the case which lead to the alarms
{"log":"2024-10-13T15:30:01.533773131Z stderr F W1013 15:30:01.533623 39 db_access.cc:6042] DB disk usage percentage crossed 90 pausing ingestion"
Resolution
1) Check if there is split bring across metrics-postgresql-ha-postgresql pods if split-brain perform the "napp-k rollout restart statefulset metrics-postgresql-ha-postgresql"
napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
napp-k exec -it metrics-postgresql-ha-postgresql-1 bash
repmgr -f build/repmgr/conf/repmgr.conf cluster show --compact
2)
A) On metrics-postgresql-ha-postgresql-0
napp-k exec -it metrics-postgresql-ha-postgresql-0 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1
SELECT pg_is_in_recovery();
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();
B) On metrics-postgresql-ha-postgresql-1
napp-k exec -it metrics-postgresql-ha-postgresql-1 bash
PGPASSWORD=$POSTGRES_PASSWORD psql -w -U postgres -d metrics -h 127.0.0.1
SELECT pg_is_in_recovery();
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();
If any of the above gives the following errors
ERROR: function get_disk_usage_percentage() does not exist (OR) ERROR: function clear_inactive_replication_slots() does not exist
Please help execute the following psql commands on the node where SELECT pg_is_in_recovery() returns f.
a) /* Function returns the current disk usage percentage of the Postgres data dir */
CREATE OR REPLACE FUNCTION get_disk_usage_percentage() RETURNS integer as $$
DECLARE
disk_usage integer;
BEGIN
CREATE TEMP TABLE IF NOT EXISTS tmp_sys_df (content text) ON COMMIT DROP;
b)/* Get the df output from the OS and extract the percentage usage value from the same */
COPY tmp_sys_df FROM PROGRAM 'OUTPUT=$(df $PGDATA | tail -n +2) && echo $OUTPUT';
disk_usage=(SELECT SPLIT_PART(SPLIT_PART(content, ' ', 5),'%',1)::integer FROM tmp_sys_df);
RETURN disk_usage;
END;
$$ LANGUAGE plpgsql;
c) /* Function checks for inactive replication slots and drops them */
CREATE OR REPLACE FUNCTION clear_inactive_replication_slots() RETURNS void as $$
DECLARE
slot_names varchar;
node_count integer;
replication_slots_count integer;
inactive_replication_slots_count integer;
BEGIN
d) /* Get the number of nodes in the cluster from the repmgr db*/
CREATE TEMP TABLE IF NOT EXISTS tmp_cluster_nodes (content text) ON COMMIT DROP;
COPY tmp_cluster_nodes FROM PROGRAM 'PGPASSWORD=$REPMGR_PASSWORD psql -w -U $REPMGR_USERNAME -d $REPMGR_DATABASE -h 127.0.0.1 -t -c ''select count(node_id) from nodes;''';
node_count=(SELECT TRIM(content)::integer FROM tmp_cluster_nodes LIMIT 1);
RAISE INFO 'Number of nodes in the cluster %', node_count;
5) /* Get the total replication slots */
replication_slots_count=(SELECT count(slot_name) from pg_replication_slots);
RAISE INFO 'Number of replication slots %', replication_slots_count;
6) /* Get the inactive replication slots count */
inactive_replication_slots_count=(SELECT count(slot_name) FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE);
RAISE INFO 'Number of inactive replication slots %', inactive_replication_slots_count;
7) /* Drop inactive replication slots only if the number of replication slots are more than or equal to the number of nodes in the cluster */
IF replication_slots_count >= node_count THEN
FOR slot_names IN SELECT slot_name FROM pg_replication_slots WHERE active_pid IS NULL OR active_pid NOT IN(SELECT pid FROM pg_stat_replication) AND active IS NOT TRUE
LOOP
RAISE INFO 'Dropping inactive replication slot %', slot_names;
PERFORM pg_drop_replication_slot(slot_names);
END LOOP;
ELSE
IF inactive_replication_slots_count > 0 THEN
RAISE INFO 'Number of replication slots less than number of nodes. Not dropping the inactive replication slots';
ELSE
RAISE INFO 'No inactive replication slots';
END IF;
END IF;
END;
$$ LANGUAGE plpgsql;
8) *Execute the inactive replication slot cleanup upfront once*/
SELECT clear_inactive_replication_slots();
Once this is done, execute
SELECT get_disk_usage_percentage();
SELECT clear_inactive_replication_slots();
and make sure there are no errors.
Once this is successfully done, the alarms should go away and entries like the one below should not be seen in the logs for the node where SELECT pg_is_in_recovery() returns f
{"log":"2024-10-15T08:15:00.879763632Z stdout F 2024-10-15 08:15:00.879 GMT [9981] ERROR: function get_disk_usage_percentage() does not exist at character 8"
{"log":"2024-10-15T08:15:00.879799552Z stdout F 2024-10-15 08:15:00.879 GMT [9981] HINT: No function matches the given name and argument types. You might need to add explicit type casts."
{"log":"2024-10-15T08:15:00.8798088Z stdout F 2024-10-15 08:15:00.879 GMT [9981] STATEMENT: SELECT get_disk_usage_percentage();"