How can we optimize nas probe performance and scalability to ensure stability and consistency?
Unexpected alarm issues/anomalies can be caused by nas administration/maintenance failures due to the size of the local nas SQLite .db files (transactionlog.db > 300-500 MB), or backend nas tables growing too large (>1M rows).
- growth of environment/nas configuration
Release : 8.51 or higher
Component : UIM - NAS
- Offload nas AO profiles/preprocessing rules from the Primary whenever possible. The inventory of alarms being sent to any one nas probe can be significantly reduced through the use of secondary nas'es - as you'll have each secondary nas probe managing a 'subset' of the overall alarm inventory.
- If you choose to deploy remote/secondary nas’es to cut down on the primary nas overhead, once the secondary nas probes have been distributed and the Auto Operator logic migrated to preprocessor logic (as needed), then you'll need to set up forwarding / replication rules on each nas probe:
- You can deploy a nas on each of the busiest hubs/tunnel concentrators. This nas would only run the AO Profiles and preprocessing scripts that prevent alarms from 'propagating.' For example, if it makes sense, you can close the alarms locally if they are generated by robots attached to that hub. Otherwise you can use nas replication to move any open alarms to the Primary hub where the primary nas resides. This nas would then be responsible for running all remaining AO profiles/processing scripts.
- Do not enable the NIS Bridge on ALL secondary hubs (including the HA backup hub). By default, it should not even be there if you just install the nas probe, but if you start by copying the nas.cfg from the primary hub nas, it will be there and it will be enabled. Only one nas can write to the backend nas table at one time.
Monitoring Governance/Alarm Reduction
It is tempting to enable/'turn on' a lot of monitoring when you first deploy UIM but over time this can cause havoc. Best Practice is to only enable alarm thresholds for Key Performance Indicators (KPIs), that are associated with an upstream effect on business in some way. Try starting with a maximum of 5 KPIs per application/technology. Ask Support if we have any suggested KPIs for vmware, Citrix, Netapp, Nutanix, Exchange, etc. That is the base starting point - a small number of KPIs (key metrics). Always ask, why it’s important to collect the data (QOS or alarms), how often it needs to be collected and why, and how long it should be stored. Keep all of these monitoring aspects to a minimum. Note that some probes have monitoring enabled 'right out of the box' for many metrics - but customers must always decide which ones MUST be kept enabled versus the nice-to-have's.
UIM suppresses like alarms and updates the alarm count. But should we keep generating alarms in the hundreds/thousands letting the alarm counts increase exponentially? This is not a good practice as this can adversely affect nas performance, scalability and nas housekeeping (maintenance) as well as alarm displays and reliability, not to mention use of system resources as well. For any alarm suppression counts > 100, it begs the question, will the nas be able to handle more and more alarms with higher and higher counts? (not without running into performance/display or nas sync issues and even delays in alarm processing), or display issues in the alarm view. You should take the necessary time to review and adjust monitoring policy and process for alarms and related ticket handling.
Why allow this to happen if no actions will be taken to alleviate the issue/resolve the problem? For example, of what use is it to have 8000+ Robot Inactive alarms continuously being increased when nothing is being done about it. Its just noise and it places an unnecessary load on the environment which usually worsens over time.
Local nas database files (transactionlog.db, database.db)
Note that in the nas GUI there are two tabs "Transaction Log" and "NiS Bridge" which independently set the alarm data retention rates for the tables. The "Transaction Log" tab sets the retention for the SQLite database. IF this is set lower than the retention in "NiS Bridge" then you will see less alarms...for example if the transaction log is set to 30 days but NiS Bridge is set to 90 days, when you drop the nas tables and re-sync, it will only be able to sync back 30 days of alarms because that's all that will be retained in the SQLite database.
In general, it is best to keep the size of the local nas database files, transactionlog.db less than 300 MB and the database.db (live alarms), less than 100 MB. Alarms live in two different places: In nas, (the alarm subconsole), active alarms are in the database.db file, and historical data (cleared alerts) in the transactionlog.db file. This is actually an SQLite database and a free SQLite browser can be used to query against and extract the data if necessary.
In very large UIM environments where there are frequent alarms and high counts, it is best to keep the transaction log settings even so low as possible.
Backend nas database tables (NiS Bridge)
In the Alarm Console in UMP, the data is pulled from the NAS_ALARMS table (active alarms), the NAS_TRANSACTION_LOG table (transaction history) and NAS_TRANSACTION_SUMMARY table (transaction summary). So you can query against both of those 'transaction' tables, but be aware that the data will only be stored as long as the 'Transaction Log Management' time periods have been specified in the nas.
If any of the nas DB tables become too large/or backend DB tables become larger than 1M rows, you may start to see unexpected results, e.g., alarm delays, missed alarms, nas 'sync' issues, scripts not firing or not firing on time, alarms not clearing intermittently, etc.
If one or more queries being run somehow involves the backend NAS tables, e.g., in UMP/USM searching alarm history, the query may be slow or become hung and never finish, e.g., if the NAS_TRANSACTION_SUMMARY table was > 100M rows.
Note that millions of rows in the nas_transaction_log can will cause performance issues with the nas probe, and the NiS synch process will take a long time which affects making configuration changes. You can either run a delete statement and delete rows from that table based on a given date or TRUNCATE the table, but see the url listed below for a more solid and potentially permanent approach for you to consider which enables the nas housekeeping to complete its run and in turn minimizes the size of that table over time.
NAS_TRANSACTION_LOG table growing indefinitely
Tips and Techniques
- Try to avoid having all of your nas AO Profiles, preprocessing rules and scripts running on the Primary hub because eventually you will have unexpected results due to the load on the nas.
- Try to avoid using very frequent intervals, such as a large number of nas AO rules running every minute.
- Make sure that no 'custom' probes are generating unnecessary noise / traffic to the nas increasing its overhead. To analyze any custom probe/script take a close look at what the probe/script is doing in terms of alarms processing and decide if its optimal or not. In one case, we found a significant number of clear messages being sent unnecessarily, which added to nas overhead. nas AO was auto-acknowledging them and adding unneeded events to the transactionlog.db.
- Note that "on arrival" applies to the arrival of the message that triggered the AO. So if your script modifies the alarm, that alarm is put back on the bus and "arrives" again. So, essentially what you do in this situation is create a loop where every time you update the message it causes another update.
- The "On overdue age" option is vaguely named but translates to "Run once the alert is older than this." It fires only once because the alert is only ever older than the specified value once.
- Small values of overdue age e.g., 1-4 seconds may yield unexpected results. It seems that there is a point where it takes a time stamp and then writes that to the local database and then it looks up everything that should run and if it took longer than that one second to get committed to the database, you'll lose it. In the scheme of things, 5 seconds is probably better than one.
- Also note that the nas only runs 1 profile at a time so if you have a bunch firing, you could see some unexpected results as some get delayed because of the others.
- Do not use or only use sparingly, the "On Interval" option.
- Ensure that all of your rules are 100% unique so that you don’t have any redundant/competing rules or rules that are in conflict with one another.
In some severe cases, where the number of open alarms in UIM has reached a very high number, e.g., 5000+ or even much more, it may require one or more of the following steps to make the message environment more stable. For example:
1. Reduce alarms, e.g., adjust hub/robot setting to eliminate/reduce robot inactive alarms, reduce the number of metrics being monitored that are not KPIs, nor have business reason/impact associated with the alarms/thresholds, etc. Any/all alarms with high counts need to be closely examined. This should be an ongoing effort until any unnecessary alarms and/or alarm counts are at a reasonable level.
2. Adjust the Transaction log retention to very low values so the transactionlog.db and database.db are reduced. In one large global environment, we set it to 1, 2 and 3 with 1-hour administration intervals (default), but perhaps 1, 5, and 15 may be enough - it depends on the size and scale of your global UIM environment.
The NiS Bridge settings can be set lower than their defaults of 7, 30 and 90, to 7, 15 and 30. Again, keep the 1 hour Administration interval if possible.
If you nas Administration ("housekeeping") as it is noted in the log, cannot complete within 1 hour, with reasonable settings, you probably still have too many alarms.
3. Start offloading nas AO profiles or preprocessing rules ASAP to one or more nas'es on remote hubs.
4. If the nas housekeeping job cannot finish, you'll see evidence of that in the nas.log at loglevel 5 (search for housekeeping). To alleviate this issue, you may have to manually reduce the size of some of the backend nas tables by deleting rows that are older than a given date/timestamp.
Once the nas tables are at a reasonable size, e.g., less than 1M rows, the housekeeping will most likely complete and the hourly administration should continue without issue but you should check it over a few days to be sure.
5. Increase monitoring intervals so that they are not unnecessarily creating alarms or increasing alarm count due to their frequency. For example, probes like snmpcollector monitoring a ton of devices every 3 or 5 minutes, cdm iostat monitoring every 1 minute and so forth.
6. In regards to the transaction log retention settings, the 'retention count down' doesn’t start until the active alarm is cleared/acknowledged. Often alarms with extreme suppression counts have originated and have a life that is way beyond this setting. In other words, for any given alarm, if the retention setting is 7 days, that 7 days doesn’t start until the active alarm is cleared/ack’d.
NAS NIS Bridge Administration (old data housekeeping) fails
Why do high alarm counts adversely affect the nas probe functionality, scalability and/or performance?
For every single active alarm there is 1 entry in the NAS_ALARMS and NAS_TRANSACTION_SUMMARY table, but in the NAS_TRANSACTION_LOG there is one row for every reoccurrence of an active alarm that is suppressed, at a minimum. For example, if there is one active alarm with a suppression count of 1000, that means there are, at a minimum, 1000 records/rows in the NAS_TRANSACTION_LOG table. Hence there is also a buildup of the local transactionlog.db which contributes to decreased nas performance if the local transactionlog.db becomes too large, e.g., > 300MB or more. This may also come into play during nas housekeeping/cleanup or any query against that local transactionlog.db table on the file system. Also, everytime you open or close the nas GUI, the nas 'sync' occurs. A very large transaction log table affects this sync action and also fills up disk space.
If the local nas database.db contains a large number of alarms, e.g., > 3000, and the transactionlog.db is also large, e.g., GBs, with very high alarm counts for many alarms, one or more of the following efforts should be started as soon as possible to slowly but surely reduce alarm 'bloat' and allow the nas environment to function without any further random, intermittent, unexpected results/issues.
NAS Architecture should be reviewed and in all cases where the load on the main nas can be reduced, that should be addressed as per this nas best practices KB Article.
- High-Level Alarm Policy or principles - any given alarm should have either a business reason, a technical reason or both, and if it doesn't, it should be seriously considered for elimination.
- Document all alarms with high alarm counts, copy alarms with high counts > 500, into Excel for later reference during the alarm reduction project.
- Reassess and adjust thresholds - if they are too low (bst practice is to enable baselining for KPIs (3-5) to identify the normal behaviour of any business/technical metric and after the baseline is developed, adjust (fine-tune) the threshold values to significantly reduce alarm frequency)
- Reduce alarm frequency - re-examine the monitoring interval frequency and increase it if there isn't a good business reason to have it set to a low value. For example, CDM iostat monitoring, website monitoring, url monitoring, CPU/Memory/Disk, device monitoring and so on.
- Administer defunct systems - For robot inactive alarms, implement a plan to effectively decommission any robots that need to be removed from the monitoring environment. Refer to KB Articles on decommissioning robots.
- Disk monitoring - if any tablespaces (datafiles) being monitored are set to Autoextend, only set up monitoring for when they are close to reaching their maximum size. Autoextend is common in Oracle DBs.
- Reduce general alarm noise - Eliminate any/all unnecessary alarm thresholds unless they serve some related business purpose or are required for critical resource monitoring.
- CPU monitoring - decide upon valid threshold values for Windows versus UNIX/Linux, e.g., 99% for 20 minutes/n samples on Windows, versus 99% for 5 minutes on UNIX/Linux.
- Physical Memory - when 10% remains or whatever makes the most sense based on your requirements for each application/OS/server type...
- Process monitoring - don't monitor a process very frequently unless there is a good reason to do so. Perhaps the reason may be temporary.
- Live alarms ideally for a single customer should be kept at what a UIM Administrator may consider a manageable level, e.g., less than 2k? That will depend on the environment and extensiveness of he monitoring footprint.
- The bottom line to extremely high alarm counts is that they exist and persist usually because no one knows what to do to resolve it or cares enough about the alarm to resolve it, or has the time to address it. That said, there is no use in allowing the alarm to continue to persist and build up, e.g., for weeks or months, unless there is a solid plan in place to manage them.
You can undertake efforts to reduce the overall load on the nas but if no monitoring governance is undertaken, and alarms are not resolved, and/or they never will be, the alarm thresholds should be disabled otherwise the alarms will simply continue to build up day after day. Alarms are worthless unless some plan of action is put into place to handle them.
To start with, you could delete any alarms older than a week or two based on Origin Time. Then start doing monitoring governance on those alarms that no one is responding too - turn them off (disable the thresholds). The underlying root problem is most commonly the total number of active alarms and alarms with high counts.