Telegraf collected service objects in custom group switching policy and still triggering alert: "Telegraf Agent monitored service is not running"
search cancel

Telegraf collected service objects in custom group switching policy and still triggering alert: "Telegraf Agent monitored service is not running"

book

Article ID: 405316

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • You have a custom group created for custom monitored services being collected by the Telegraf agent.  
  • This custom group applies a custom policy to these services to disable certain alerts for any known issue with collecting these services.
  • The alert generated is: "Telegraf Agent monitored service is not running"
  • The policy on these services will randomly switch back to the default policy and trigger these alerts which were expected to be disabled.  
  • After a period of time, the policy will be switched back to the appropriate policy and the alerts will clear.  
  • You need to understand what is causing the policy change and what can be done to prevent the policy from changing in the future.

Environment

Aria Operations 8.18.3

Cause

In this particular use case, the service objects had all the following criteria which led to the cause of the behavior:

  • The objects are in a custom group
  • Objects have a custom policy applied to the group to disable the alert
  • When the service(s) were trying to be collected by the telegraf agent, it was always failing with a "Permission is Denied" error in the Application Monitoring Adapter instance
  • Due to the collection failure, the related service objects were always in a "No Data Receiving" state

This was causing the objects to go into a very specific lifecycle of events within the product that was causing the policy change and alerts:

  1. Telegraf could never collect on the service so Aria Operations puts these resources into a 'No Data Receiving' state.
  2. If a resource is in the 'No Data Receiving' state for 2,016 five-minute cycles (7 days) the Application Monitoring adapter instance moves the resource into the 'Not Existing' state.
  3. When a resource is moved into the 'Not Existing' state, it is unloaded from the cache.
  4. This can result in an empty custom group (all members removed)
  5. In the next collection cycle, all the 'Not Existing' resources are re-discovered
  6. When the resources are re-discovered, they are loaded back into the cache
  7. When a resource is loaded back into the cache, it is then put into the default policy
  8. This default policy has the alert enabled, which then gets triggered
  9. The custom group membership refresh runs every 20 minutes, so within 20 minutes (the next time this refresh runs) these re-discovered resources are put back into the custom group, putting them back into the correct policy, and cancelling the alert.
  10. The resources (custom services) collection continues to fail from telegraf, putting them back into the 'No Data Receiving' state and the process repeats

 

Resolution

  • There currently is no resolution to this issue from an Aria Operations standpoint.
  • If there are custom services being monitored by telegraf  agents that are consistently failing to collect due to a permission issue or other environment issue, it's necessary to resolve that issue
  • Alternatively, if you don't want to see this issue, disable the custom service monitoring on the problem service(s) from Manage Telegraf Agents page until the permission or environment issue is resolved