How can we optimize nas probe performance and scalability to ensure stability and consistency?
How do we automate and ensure the success of nas maintenance and administration?
Unexpected alarm issues/anomalies can be caused by nas administration/maintenance failures due to the size of the local nas SQLite .db files (transactionlog.db > 300-500 MB), or backend nas tables growing too large (greater than 1 million rows).
NAS Architecture
Offload nas AO profiles/preprocessing rules from the Primary whenever possible. The inventory of alarms being sent to any single instance of the nas probe can be significantly reduced through the use of secondary nas'es - as you'll have each secondary nas probe managing a 'subset' of the overall alarm inventory and any associated rules and/or scripts.
If you choose to deploy remote/secondary nas’es to cut down on the primary nas overhead, once the secondary nas probes have been distributed and the Auto Operator logic migrated to preprocessor logic (as needed), then you'll need to set up forwarding / replication rules on each nas probe:
You can deploy a nas on each of the busiest hubs/tunnel concentrators. This nas would only run the AO Profiles and preprocessing scripts that prevent alarms from 'propagating.' For example, if it makes sense, you can close the alarms locally if they are generated by robots attached to that hub. Otherwise, you can use nas replication to move any open alarms to the Primary hub where the primary nas resides. This nas would then be responsible for running all remaining nas AO profiles/processing scripts.
Do not enable the NIS Bridge on ALL secondary hubs (including the HA backup hub). By default, it should not even be present if you just install the nas probe, but if you start by copying the nas.cfg from the primary hub nas, it will be there and it will be enabled.
Only one nas can write to the backend nas table at one time. If a second nas NiS Bridge is enabled at the same time as the Primary nas is running, it will cause duplicate key errors in the nas log. For example:
nas: COM Error [0x80040e2f] IDispatch error #3119 - [Microsoft OLE DB Provider for SQL Server] Violation of UNIQUE KEY constraint 'UQ__NAS_TRAN__475AXXXXXXXF79D5'. Cannot insert duplicate key in object 'dbo.NAS_TRANSACTION_SUMMARY'. The duplicate key value is (XLXXXX1264-3XX86)
Monitoring Governance/Alarm Reduction
It is tempting to enable/'turn on' a lot of monitoring when you first deploy UIM but over time this can cause havoc. Best Practice is to only enable alarm thresholds for Key Performance Indicators (KPIs), that are associated with an upstream effect on business in some way. Try starting with a maximum of 5 KPIs per application/technology. Ask Support if we have any suggested KPIs for vmware, Citrix, Netapp, Nutanix, Exchange, etc. That is the base starting point - a small number of KPIs (key metrics). Always ask, why it’s important to collect the data (QOS or alarms), how often it needs to be collected and why, and how long it should be stored. Keep all of these monitoring aspects to a minimum. Note that some probes have monitoring enabled 'right out of the box' for many metrics - but customers must always decide which ones MUST be kept enabled versus the nice-to-have's.
UIM suppresses like alarms and updates the alarm count. But should we keep generating alarms in the hundreds/thousands letting the alarm counts increase exponentially? This is not a good practice as this can adversely affect nas performance, scalability and nas housekeeping (maintenance) as well as alarm displays and reliability, not to mention use of system resources as well. For any alarm suppression counts > 100, it begs the question, will the nas be able to handle more and more alarms with higher and higher counts? (not without running into performance/display or nas sync issues and even delays in alarm processing), or display issues in the alarm view. You should take the necessary time to review and adjust monitoring policy and process for alarms and related ticket handling.
Why allow this to happen if no actions will be taken to alleviate the issue/resolve the problem? For example, of what use is it to have 8000+ Robot Inactive alarms continuously being increased when nothing is being done about it. It's just 'noise' and it places an unnecessary load on the environment which usually worsens over time.
Local nas database files (transactionlog.db, database.db)
Note that in the nas GUI there are two tabs "Transaction Log" and "NiS Bridge" which independently set the alarm data retention rates for the tables. The "Transaction Log" tab sets the retention for the SQLite database. IF this is set lower than the retention in "NiS Bridge" then you will see less alarms...for example if the transaction log is set to 30 days but NiS Bridge is set to 90 days, when you drop the nas tables and re-sync, it will only be able to sync back 30 days of alarms because that's all that will be retained in the SQLite database.
In general, it is best to keep the size of the local nas database files, transactionlog.db less than 300 MB and the database.db (live alarms), less than 100 MB. Alarms live in two different places: In nas, (the alarm subconsole), active alarms are in the database.db file, and historical data (cleared alerts) in the transactionlog.db file. This is actually an SQLite database and a free SQLite browser can be used to query against and extract the data if necessary.
In very large UIM environments where there are frequent alarms and high counts, it is best to keep the transaction log settings as low as possible but not so low that it causes you to miss messages/activity that you need to troubleshoot.
Backend nas database tables (NiS Bridge)
In the Alarm Console in UMP/OC, the data is pulled from the NAS_ALARMS table (active alarms), the NAS_TRANSACTION_LOG table (transaction history) and NAS_TRANSACTION_SUMMARY table (transaction summary). So you can query against both of those 'transaction' tables, but be aware that the data will only be stored as long as the 'Transaction Log Management' time periods have been specified in the nas.
If any of the nas DB tables listed below become too large/or backend DB tables become larger than approx. 1M rows, you may start to see unexpected results, e.g., alarm delays, missed alarms, nas 'sync' issues, scripts not firing or not firing on time, alarms not being generated due to rules not firing, alarms not clearing intermittently, etc.
- NAS_ALARMS
- NAS_TRANSACTION_SUMMARY
- NAS_TRANSACTION_LOG
If one or more queries being run somehow involve the backend NAS tables, e.g., in UMP/USM/OC searching alarm history, the underlying query/java script may respond slowly or become hung and never finish, e.g., if the NAS_TRANSACTION_SUMMARY table is greater than 100M rows.
Note that millions of rows in the nas_transaction_log will cause performance issues with the nas probe, and the NiS synch process will take a long time. This also adversely affects making configuration changes.
You can either manually run a delete statement and delete rows from that table based on a given date or TRUNCATE the table, but see the url listed below for a more solid and potentially permanent approach for you to consider which enables the nas housekeeping to complete its run, and in turn minimizes the size of that table over time.
NAS_TRANSACTION_LOG table administration (housekeeping) unsuccessful and not being maintained
https://knowledge.broadcom.com/external/article/113086
Nimsoft Alarm Server (NAS) Tips and Techniques
Try to avoid having all of your nas AO Profiles, preprocessing rules and scripts running on the Primary hub because eventually, you may have unexpected results due to the load.
Try to avoid using very frequent intervals, such as a large number of nas AO rules running every minute.
Make sure that no 'custom' probes are generating unnecessary noise / traffic to the nas increasing its overhead.
To analyze any custom probe/script take a close look at what the probe/script is doing in terms of alarm processing and decide if it's optimal or not. In one case, we found a significant number of clear messages being sent unnecessarily, which added to the nas overhead. In this case, the nas AO was auto-acknowledging them and adding unneeded events to the transactionlog.db.
Be aware that the nas option "On arrival" applies to the arrival of the message that triggered the AO. So if your script modifies the alarm, that alarm is put back on the message BUS and "arrives" again. So, essentially what you do in this situation is create a loop where every time you update the message it causes another update.
The "On overdue age" option is vaguely named but translates to "Run once the alert is older than x." It fires only once because the alert is only ever older than the specified value once.
Small values of overdue age e.g., 1-3 seconds may yield unexpected results. It seems that there is a point where it takes a timestamp and then writes that to the local database and then it looks up everything that should run and if it took longer than that one second to get committed to the database, you'll lose it. In the scheme of things, 5 or mmore seconds is probably better than trying to use 1-4 seconds.
Also, note that the nas only runs 1 profile at a time so if you have a bunch of profiles firing, you could see some unexpected results as some get delayed because of the others that are running.
Do not use, or only use sparingly, the "On Interval" option.
Ensure that all of your rules are 100% unique so that you don’t have any redundant/competing rules or rules that are in conflict with one another.
In some severe cases, where the number of open alarms in UIM has reached a very high number, e.g., 5000+ or even much more, it may require one or more of the following steps to make the message environment more stable. For example:
1. Reduce alarms
Adjust robot settings to eliminate/reduce robot inactive alarms, reduce metrics that are not true KPIs, nor have business reason/impact associated with the alarm, etc.
Any/all alarms with high counts need to be closely examined. This is an ongoing effort until unnecessary alarms/alarm counts are at a reasonable level.
2. Adjust the Transaction log retention to very low values
This is so the transactionlog.db and database.db are reduced.
In one large global environment, we set it to 1, 2, & 3 with 1-hour admin intervals, but perhaps 1, 5, and 15 may be enough.
This depends on the size/scale of your environment.
The NiS Bridge settings can be set lower than their defaults of 7, 30 and 90, to 7, 15 and 30. Again, keep the 1-hour Administration interval if possible.
If your nas Administration ("housekeeping") as it is noted in the log, cannot complete within 1 hour, with reasonable settings, you probably still have too many alarms for the nas to manage successful housekeeping.
3. Start offloading nas AO profiles or preprocessing rules ASAP to one or more nas'es on secondary/remote hubs.
4. If the nas housekeeping job cannot finish, you'll see evidence of that in the nas.log at loglevel 5 (search for 'housekeeping').
To alleviate this issue, you may have to manually reduce the size of some of the backend nas tables by deleting rows that are older than a given date.
Once the nas tables are at a reasonable size, e.g., less than 1M rows, the housekeeping will most likely complete, and the hourly administration should continue without issue but you should check it over a few days to be sure.
5. Increase monitoring intervals so that they are not unnecessarily creating alarms or increasing alarm count due to their frequency.
For example, probes like snmpcollector that are monitoring a ton of devices every 3 or 5 minutes, cdm iostat monitoring every 1 or 5 minutes and so forth.
6. In regards to the transaction log retention settings, the 'retention count down' doesn’t start until the active alarm is cleared/acknowledged.
Often alarms with extreme suppression counts have originated and have a life that is way beyond this setting.
In other words, for any given alarm, if the retention setting is 7 days, that 7 days doesn’t start until the active alarm is cleared/ack’d.
NAS NIS Bridge Administration (old data housekeeping) fails
https://knowledge.broadcom.com/external/article/33674
For every single active alarm, there is 1 entry in the NAS_ALARMS and NAS_TRANSACTION_SUMMARY table, but in the NAS_TRANSACTION_LOG there is one row for every reoccurrence of an active alarm that is suppressed, at a minimum. For example, if there is one active alarm with a suppression count of 1000, that means there are, at a minimum, 1000 records/rows in the NAS_TRANSACTION_LOG table.
So, as a result, there is also a buildup of the local transactionlog.db which contributes to decreasing nas performance if the local transactionlog.db becomes too large, e.g., > 300MB or more. This may also come into play during nas housekeeping/cleanup or any query against that local transactionlog.db table on the file system. In other words, the nas housekeeping may not complete.
Also, every time you open or close the nas GUI, the nas 'sync' occurs. A very large transaction log table affects/delays this sync action and may also fill up disk space.
Reduce alarms/alarm counts
If the local nas database.db contains a large number of alarms, e.g., > 3000, and the transactionlog.db is also large, e.g., GBs, with very high alarm counts for many alarms, one or more of the following efforts should be started as soon as possible to slowly but surely reduce alarm 'bloat' and allow the nas environment to function without any further random, intermittent, unexpected results/issues.
Document all alarms with high alarm counts, copy alarms with high counts > 500, into Excel for later reference during the alarm reduction project.
The bottom line regarding extremely high alarm counts is that they exist and persist usually because no one knows what to do to resolve it or cares enough about the alarm to resolve it, or has the time to address it. That said, there is no use in allowing the alarm to continue to persist and build up, e.g., for weeks or months, unless there is a solid plan in place to manage them.
Review NAS Architecture
The nas architecture should be reviewed and in all cases where the load on the main nas can be reduced, that should be addressed as per this nas best practices KB Article.
High-Level Alarm Policy or Principles
Any given alarm should have either a valid business reason, a technical reason, or both, and if it doesn't, it should be seriously considered for elimination.
Reduce alarm frequency
Reexamine the monitoring interval frequency and increase it if there isn't a good business reason to have it set to a low value. For example, CDM iostat monitoring, website monitoring, url monitoring, CPU/Memory/Disk, device monitoring, vmware monitoring, and so on.
Administer defunct systems - For robot inactive alarms, implement a plan to effectively decommission any robots that need to be removed from the monitoring environment. Refer to KB Articles on decommissioning robots.
Disk monitoring
If any tablespaces (datafiles) being monitored are set to Autoextend, only set up monitoring for when they are close to reaching their maximum size. Autoextend is common in Oracle DBs.
Reduce alarm 'noise'
Eliminate any/all unnecessary alarm thresholds unless they serve some related business purpose or are required for critical resource monitoring.
CPU monitoring
Decide upon valid threshold values for Windows versus UNIX/Linux, e.g., 99% for 20 minutes/n samples on Windows, versus 99% for 5 minutes on UNIX/Linux.
Physical Memory
When 10% remains or whatever makes the most sense based on your requirements for each application/OS/server type.
Process monitoring - don't monitor a process very frequently unless there is a good reason to do so. Perhaps the reason may be temporary.
Live alarms
Ideally for a single customer, Live alarms should be kept at what a UIM Administrator may consider a manageable level, e.g., less than 2k? That will depend on the environment and extensiveness of the monitoring footprint.
To start with, you could delete any alarms older than a week or two based on Origin Time. Then start doing monitoring governance on those alarms that no one is responding to - turn them off (disable the thresholds). The underlying root problem for performance is most commonly the total number of active alarms and alarms with high counts.