Alarms with any of the following description -
"Metrics service {service-name} is degraded"
"Metrics service {service-name} is down"
e.g
Metrics service metrics-manager is degraded
Metrics service metrics-query-server is down
The {service-name} can be any of the following
If the alarm stays open for more than 30 minutes or if its occurring multiple times, please proceed to the Resolution section
One or more replica pods of Metrics service {service-name} are not in a running state
Maintenance window required for remediation?
No
Steps to resolve:
(1) Restart the affected Metrics service deployment/statefulset (this should resolve transient issues).
Run 'k rollout restart statefulset metrics-postgresql-ha-postgresql -n nsxi-platform'
Run 'k rollout restart deployment {service-name} -n nsxi-platform'
(2) Wait ~5 minutes, then verify whether the {service-name} pods recovered:
'k get pods -n nsxi-platform | grep {service-name}'
(3) If the service does not recover, run the script below to check the health of metrics-postgresql-ha-pgpool.
This script monitors CPU utilization and increases the CPU limit if required. Copy the attached script from this KB (check_and_fix_metrics-postgresql-ha-pgpool_health.sh) to the SSPI appliance, then run:
chmod +xcheck_and_fix_metrics-postgresql-ha-pgpool_health.sh
./check_and_fix_metrics-postgresql-ha-pgpool_health.sh
Wait for ~20 minutes and check if the alarm is auto-resolved
If the alarm persists, check for the following
If none of the above is applicable, please open a ticket with Broadcom Support.