Prometheus server failing every 5 minutes
search cancel

Prometheus server failing every 5 minutes

book

Article ID: 423461

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

  • Prometheus server CPU Utilization is spiking up every 5 minutes,
  • Prometheus server might restart every 5 minutes, which corresponds to the collection cycle.

Environment

Aria Operations Management Pack for Kubernetes - Version 2.2

Cause

This is caused by the custom user defined Prometheus queries.
When running custom user defined Prometheus queries (under configuration files) which are complex/resource intensive, Prometheus can get overloaded leading to Prometheus Pod getting restarted.

Resolution

  • To fix this issue, review and reduce the user defined queries that are too resource demanding.

 

  • The below workaround can be applied to mitigate the issue. If the issue persists, the above fix is needed.
    Reduce the number of concurrent threads making calls to Prometheus in Kubernetes MP:
  1. SSH to Aria Operations Node where the Kubernetes adapter is running.
  2. Edit the properties file under directory "/usr/lib/vmware-vcops/user/plugins/inbound/KubernetesAdapter3/conf/kubernetes_adapter.properties" and reduce the value of the "THREAD_POOL_SIZE" to a conservative value of around 3.
  3. Once the file is saved, restart the adapter.