Issue: When stress testing thousands of alerts in UIM the syncing to Spectrum Oneclick stops.
Required: As we want to move from VAIM to UIM we need to ship alerts from UIM to Spectrum with the spectrumgtw probe.
Observed under normal load: unidirectional alerts UIM-->Spectrum work fine. Alerts are sync'd to Spectrum every 30 seconds.
We needed to stress test the generation and syncing of alerts as network outages can create potentially thousands of alerts. When creating a large number of simultaneous alerts it appears to break the syncing of alerts from UIM to Spectrum Oneclick.
Spectrum 20.2.3 (10.4.2.1.73)
Spectrumgtw probe 8.68HF1
nas probe 9.32
ems probe 10.25
alarm_enrichment probe 9.32
trellis probe 20.30
What I have tried so far.
1) Spectrumgtw Probe
Tried restarting the Spectrumgtw probe. This seems to sync the alerts once but then is unable to sync every 30 seconds.
I can't see anything in the logs of interest.
2) UIM side
Found some info from an old kb worth a try.
1 - deactivate ems, nas, alarm_enrichment, spectrumgtw
2 - delete the following folder in the ems directory
\Nimsoft\probes\service\ems\db (delete the db folder as there are changes in 10.17 and we need to have a clean db)
3 - delete the spectrumgtw cache (to ensure we start with clean data)
\Nimsoft\probes\gateway\spectrumgtw\cache (delete the cache dir)
4 - delete ems db
5 - activate the alarm_enrichment probe
activate the nas probe
activate the ems probe
6 - deactivate trellis and activate (Don't think this was needed but saw it in a KB)
7 - activate the spectrumgtw probe
This does not seem to do much. On one occasion the alerts started flowing about 20 minutes after doing this procedure. But I have not been able to recreate it.
3) Spectrum side
In our lab we have two spectrum landscapes that we are using so that we can test multitenancy. Bring down the main Spectro server and the Secondary Spectro server to Inactive. Reboot the main Spectro Server and restart databases on both.
The result of this is about 6 or 7 alerts coming through and then stopping again.
On two occasions the alerts started working again on their own without any intervention. Once overnight and the other time 20 minutes later after doing remedial actions #2 (UIM side) above.
When everything is working I can recreate the issue to make alerts stop working.
I do this by generating an "alert storm" of 5000 alerts. This effectively breaks the syncing of alerts but I cannot find what is actually breaking.
I can provide any other info requested. Logs, server specs, heap sizes. I can share a screen here to troubleshoot as well.
Our plan is to retire VAIM and use UIM exclusively for server alerting. We must use Spectrum Global Collections for the Enterprise Command Centre Dashboard views as this is what they use.
We cannot move UIM to production until we isolate the issue we are having and document a complete recovery procedure should support staff need to get the alerts moving again.
Release : 20.3
Component : UIM - SPECTRUMGTW
1 increase spectrumgtw memory allocation
raw configure > startup > options >
-Xms32m -Xmx1536m -Duser.language=en -Duser.country=US
-Xms64m -Xmx3072m -Duser.language=en -Duser.country=US
OK > OK > Apply > deactivate > activate
2 increase sync interval
raw configure > setup > alarm > Periodic_Alarm_Sync_Interval
This value is in seconds and it could be the interval is too small to process the alarm flood
3 increase the tomcat payload size
As of spectrum 10.4.1 the tomcat payload size is configurable.
and retstart Tomcat to effect