AWS Probe loses connection to Account or Probe goes to Failure State Config is lost

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

For a while now when an AWS probe seems to get overloaded or we lose the connection between the AWS probe and an AWS account and AWS, the probe seems to have an issue where it starts to lose its configuration of the accounts on the probe. But the probe seems to have some issue and when it does we lose configuration of the probe and the probe starts to send false alarms that an account can not be reached because the account info is no longer in the raw config.

As a result, the raw config is truncated/very small and the properties folder is missing that holds the account info and secret credentials to be able to contact the account.

Error messages in the log:

Jan 15 15:51:31:635 [pool-9-thread-1, aws] HealthRSSDataCollector::Error reading from, could not create Reader https://<example.com>/rss/xxxxxxx-us-east-1.rss
Jan 15 15:51:31:635 [pool-9-thread-1, aws] HealthRSSDataCollector::https://<example.com>/rss/xxxxxxx-us-east-1.rss
Jan 15 15:51:31:635 [pool-9-thread-1, aws] java.io.FileNotFoundException: https://<example.com>/rss/xxxxxxx-us-east-1.rss
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1898)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
    at com.nimsoft.probe.application.aws.impl.health.HealthRssDataCollector.get(HealthRssDataCollector.java:131)
    at com.nimsoft.probe.application.aws.impl.health.HealthRssDataCollector.sendAlarm(HealthRssDataCollector.java:318)
    at com.nimsoft.probe.application.aws.impl.health.HealthRssDataCollector.buildRssURL(HealthRssDataCollector.java:207)
    at com.nimsoft.probe.application.aws.impl.health.HealthRssDataCollector.run(HealthRssDataCollector.java:91)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Environment

UIM 23.4 CU9
AWS Probe version 5.44
PPM (on the hub) 20.46

Cause

configuration needed adjustment

Resolution

Increased the aws probe java memory to 4 and 6GB, (from 2 and 4GB).
Increased the threads to 100 (from the default of 50).

As per the techdocs, the threads are set to 50 by default.

thread_count: maximum number of threads that the probe can execute simultaneously.

"Recommend using the maximum number of 50, but you can also increase this count if the probe is running on a system with high CPU and memory."

Additional Information

Scalability limitation - limit of 5 AWS accounts per aws probe instance

As per Development, configuration through any kind of script is not tested nor recommended. As we witnessed during the webex it seems that the probe CFG becomes corrupted at some point when using scripts to configure the probe.

If you use any kind of scripted configuration, customers will have to manage it on their own.

If the aws probe is configured manually it should work as expected and in that case we don't see any specific limitations as per our experience with other customers.

There can be hardware resource limitations which can be handled according to the number of profiles you want to monitor. Also adjusting the Java memory parameters like Xmx and Xms etc might help to some extent - the same goes for adding virtual processors.

MCS and probe configuration packages are the currently-supported means of bulk config change for deployments - services must be engaged for any custom scripting solution for probe configuration, e.g., via API - this remains outside the scope of support.

That said, when using a scripted approach to AWS configuration we recommend only up to 5 AWS accounts per aws probe instance.

The aws probe stores a copy of CFG keys & values in-memory other than the CFG file, so there might be chances to override the CFG values when the probe is busy in doing its core functionality during regular poll cycles. You can try deactivating the AWS probe and update the values from the API and then Activate the probe. This will ensure that the probe will pick the latest values from the CFG and run the probe functionality accordingly without any unexpected overwriting or corruption.