vDefend SSP Alarm: Metrics Service status is degraded or down

search cancel

vDefend SSP Alarm: Metrics Service status is degraded or down

book

Article ID: 384112

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Alarms with any of the following description -

"Metrics service {service-name} is degraded"
"Metrics service {service-name} is down"

e.g
Metrics service metrics-manager is degraded
Metrics service metrics-query-server is down

The {service-name} can be any of the following

1. metrics-manager
2. metrics-query-server
3. metrics-app-server
4. metrics-db-helper
5. metrics-postgresql-ha-postgresql
6. metrics-postgresql-ha-pgpool

If the alarm stays open for more than 30 minutes or if its occurring multiple times, please proceed to the Resolution section

Environment

vDefend SSP >= 5.0

Cause

One or more replica pods of Metrics service {service-name} are not in a running state

Resolution

Maintenance window required for remediation?
No

Steps to resolve:

(1) Restart the affected Metrics service deployment/statefulset (this should resolve transient issues).

Log into SSPI root shell

If {service-name} is metrics-postgresql-ha-postgresql

Run 'k rollout restart statefulset metrics-postgresql-ha-postgresql -n nsxi-platform'

else

Run 'k rollout restart deployment {service-name} -n nsxi-platform'

(2) Wait ~5 minutes, then verify whether the {service-name} pods recovered:

              'k get pods -n nsxi-platform | grep {service-name}'
(3) If the service does not recover, run the script below to check the health of metrics-postgresql-ha-pgpool.
This script monitors CPU utilization and increases the CPU limit if required. Copy the attached script from this KB (check_and_fix_metrics-postgresql-ha-pgpool_health.sh) to the SSPI appliance, then run:

chmod +x check_and_fix_metrics-postgresql-ha-pgpool_health.sh
./check_and_fix_metrics-postgresql-ha-pgpool_health.sh

Wait for ~20 minutes and check if the alarm is auto-resolved

If the alarm persists, check for the following

Check for disk usage alarms, https://knowledge.broadcom.com/external/article?articleNumber=384110
Check for memory usage alarms, https://knowledge.broadcom.com/external/article?articleNumber=384111
Check for CPU usage alarms, https://knowledge.broadcom.com/external/article?articleNumber=384109

If none of the above is applicable, please open a ticket with Broadcom Support.

Attachments

check_and_fix_metrics-postgresql-ha-pgpool_health.sh.sh get_app

Feedback

thumb_up Yes

thumb_down No