Alarms with any of the following description -
"Metrics service {service-name} is degraded"
"Metrics service {service-name} is down"
e.g
Metrics service metrics-manager is degraded
Metrics service metrics-query-server is down
The {service-name} can be any of the following
If the alarm stays open for more than 30 minutes or if its occurring multiple times, please proceed to the Resolution section
One or more replica pods of Metrics service {service-name} are not in a running state
Maintenance window required for remediation?
No
Steps to resolve:
Try re-starting the deployment/statefulset. This should take care of any transient issues
Run 'k rollout restart statefulset metrics-postgresql-ha-postgresql -n nsxi-platform'
Run 'k rollout restart deployment {service-name} -n nsxi-platform'
Wait for ~5 minutes and check if the {service-name} pods recover
Wait for ~20 minutes and check if the alarm is auto-resolved
If the alarm persists, check for the following
If none of the above is applicable, please open a ticket with Broadcom Support.