vDefend SSP Alarm: Metrics Service status is degraded or down
search cancel

vDefend SSP Alarm: Metrics Service status is degraded or down

book

Article ID: 384112

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Alarms with any of the following description - 

"Metrics service {service-name} is degraded"
"Metrics service {service-name} is down"

e.g 
Metrics service metrics-manager is degraded
Metrics service metrics-query-server is down

The {service-name} can be any of the following

    1. metrics-manager
    2. metrics-query-server
    3. metrics-app-server
    4. metrics-db-helper
    5. metrics-postgresql-ha-postgresql
    6. metrics-postgresql-ha-pgpool

If the alarm stays open for more than 30 minutes or if its occurring multiple times, please proceed to the Resolution section

Environment

vDefend SSP >= 5.0

Cause

One or more replica pods of Metrics service {service-name} are not in a running state

Resolution

Maintenance window required for remediation?
No

Steps to resolve:


         (1)
Restart the affected Metrics service deployment/statefulset (this should resolve transient issues).

    • Log into SSPI root shell

    • If {service-name} is metrics-postgresql-ha-postgresql

      • Run 'k rollout restart statefulset metrics-postgresql-ha-postgresql -n nsxi-platform'
    • else
      • Run 'k rollout restart deployment {service-name} -n nsxi-platform'



         (2) Wait ~5 minutes, then verify whether the {service-name} pods recovered:

              'k get pods -n nsxi-platform | grep {service-name}'

 

(3) If the service does not recover, run the script below to check the health of metrics-postgresql-ha-pgpool.
This script monitors CPU utilization and increases the CPU limit if required. Copy the attached script from this KB (check_and_fix_metrics-postgresql-ha-pgpool_health.sh) to the SSPI appliance, then run:

chmod +x check_and_fix_metrics-postgresql-ha-pgpool_health.sh
./check_and_fix_metrics-postgresql-ha-pgpool_health.sh


Wait for ~20 minutes and check if the alarm is auto-resolved 

 If the alarm persists, check for the following 

If none of the above is applicable, please open a ticket with Broadcom Support.

 

Attachments

check_and_fix_metrics-postgresql-ha-pgpool_health.sh.sh get_app