You have a custom group created for custom monitored services being collected by the Telegraf agent.
This custom group applies a custom policy to these services to disable certain alerts for any known issue with collecting these services.
The alert generated is: "Telegraf Agent monitored service is not running"
The policy on these services will randomly switch back to the default policy and trigger these alerts which were expected to be disabled.
After a period of time, the policy will be switched back to the appropriate policy and the alerts will clear.
You need to understand what is causing the policy change and what can be done to prevent the policy from changing in the future.
Environment
Aria Operations 8.18.3
Cause
In this particular use case, the service objects had all the following criteria which led to the cause of the behavior:
The objects are in a custom group
Objects have a custom policy applied to the group to disable the alert
When the service(s) were trying to be collected by the telegraf agent, it was always failing with a "Permission is Denied" error in the Application Monitoring Adapter instance
Due to the collection failure, the related service objects were always in a "No Data Receiving" state
This was causing the objects to go into a very specific lifecycle of events within the product that was causing the policy change and alerts:
Telegraf could never collect on the service so Aria Operations puts these resources into a 'No Data Receiving' state.
If a resource is in the 'No Data Receiving' state for 2,016 five-minute cycles (7 days) the Application Monitoring adapter instance moves the resource into the 'Not Existing' state.
When a resource is moved into the 'Not Existing' state, it is unloaded from the cache.
This can result in an empty custom group (all members removed)
In the next collection cycle, all the 'Not Existing' resources are re-discovered
When the resources are re-discovered, they are loaded back into the cache
When a resource is loaded back into the cache, it is then put into the default policy
This default policy has the alert enabled, which then gets triggered
The custom group membership refresh runs every 20 minutes, so within 20 minutes (the next time this refresh runs) these re-discovered resources are put back into the custom group, putting them back into the correct policy, and cancelling the alert.
The resources (custom services) collection continues to fail from telegraf, putting them back into the 'No Data Receiving' state and the process repeats
Resolution
There currently is no resolution to this issue from an Aria Operations standpoint.
If there are custom services being monitored by telegraf agents that are consistently failing to collect due to a permission issue or other environment issue, it's necessary to resolve that issue
Alternatively, if you don't want to see this issue, disable the custom service monitoring on the problem service(s) from Manage Telegraf Agents page until the permission or environment issue is resolved