How can we optimize nas probe performance and scalability to ensure stability?
- growth of environment/nas configuration
Release : 8.51 or higher
Component : UIM - NAS
- Offload nas AO profiles/preprocessing rules from the Primary whenever possible. The inventory of alarms being sent to any one nas probe can be significantly reduced through the use of secondary nas'es - as you'll have each secondary nas probe managing a 'subset' of the overall alarm inventory.
- If you choose to deploy remote/secondary nas’es to cut down on the primary nas overhead, once the secondary nas probes have been distributed and the Auto Operator logic migrated to preprocessor logic (as needed), then you'll need to set up forwarding / replication rules on each nas probe:
- You can deploy a nas on each of the busiest hubs/tunnel concentrators. This nas would only run the AO Profiles and preprocessing scripts that prevent alarms from 'propagating.' For example, if it makes sense, you can close the alarms locally if they are generated by robots attached to that hub. Otherwise you can use nas replication to move any open alarms to the Primary hub where the primary nas resides. This nas would then be responsible for running all remaining AO profiles/processing scripts.
- Do not enable the NIS Bridge on ALL secondary hubs (including the HA backup hub). By default, it should not even be there if you just install the nas probe, but if you start by copying the nas.cfg from the primary hub nas, it will be there and it will be enabled. Only one nas can write to the backend nas table at one time.
Monitoring Governance/Alarm Reduction
It is tempting to enable/'turn on' a lot of monitoring when you first deploy UIM but over time this can cause havoc. Best Practice is to only enable alarm thresholds for Key Performance Indicators (KPIs), that are associated with an upstream effect on business in some way. Try starting with a maximum of 5 KPIs per application/technology. Ask Support if we have any suggested KPIs for vmware, Citrix, Netapp, Nutanix, Exchange, etc. That is the base starting point - a small number of KPIs (key metrics). Always ask, why it’s important to collect the data (QOS or alarms), how often it needs to be collected and why, and how long it should be stored. Keep all of these monitoring aspects to a minimum. Note that some probes have monitoring enabled 'right out of the box' for many metrics - but customers must always decide which ones MUST be kept enabled versus the nice-to-have's.
UIM suppresses like alarms and updates the alarm count. But should we keep generating alarms in the hundreds/thousands letting the alarm counts increase exponentially? This is not a good practice as this can adversely affect nas performance, scalability and nas housekeeping (maintenance) as well as alarm displays and reliability, not to mention use of system resources as well. For any alarm suppression counts > 100, it begs the question, will the nas be able to handle more and more alarms with higher and higher counts? (not without running into performance/display or nas sync issues and even delays in processing), or display issues in the alarm view. You should take the necessary time to review and adjust monitoring policy and process for alarms and related ticket handling.
Why allow this to happen if no actions will be taken to alleviate the issue/resolve the problem? For example, of what use is it to have 8000+ Robot Inactive alarms continuously being increased when nothing is being done about it. Its just noise and it places an unnecessary load on the environment which usually worsens over time.
Local nas database files (transactionlog.db, database.db)
Note that in the nas GUI there are two tabs "Transaction Log" and "NiS Bridge" which independently set the alarm data retention rates for the tables. The "Transaction Log" tab sets the retention for the SQLite database. IF this is set lower than the retention in "NiS Bridge" then you will see less alarms...for example if the transaction log is set to 30 days but NiS Bridge is set to 90 days, when you drop the tables and re-sync, it will only be able to sync back 30 days of alarms because that's all that will be retained in the SQLite database.
In general, it is best to keep the size of the local nas database files, transactionlog.db less than 300 MB and the database.db (live alarms), less than 100 MB. Alarms live in two different places: In nas, (the alarm subconsole), active alarms are in the database.db file, and historical data (cleared alerts) in the transactionlog.db file. This is actually an SQLite database and a free SQLite browser can be used to query against and extract the data if necessary.
In very large UIM environments where there are frequent alarms and high counts, it is best to keep the transaction log settings even so low as 1, 2 and 3 respectively. Instead you can set the nas NiS bridge to 7, 10, and 20 respectively and query the nas tables when needed.
Backend nas database tables (NiS Bridge)
In the Alarm Console in UMP, the data is pulled from the NAS_ALARMS table (active alarms), the NAS_TRANSACTION_LOG table (transaction history) and NAS_TRANSACTION_SUMMARY table (transaction summary). So you can query against both of those 'transaction' tables, but be aware that the data will only be stored as long as the 'Transaction Log Management' time periods have specified.
If any of the nas DB tables become too large/or backend DB tables become larger than 1M rows, you may start to see unexpected results, e.g., alarm delays, missed alarms, nas 'sync' issues, scripts not firing or not firing on time, alarms not clearing intermittently, etc.
If one or more queries being run somehow involves the backend NAS tables, e.g., in UMP/USM searching alarm history, the query may be slow or become hung and never finish, e.g., if the NAS_TRANSACTION_SUMMARY table was > 100M rows.
Note that millions of rows in the nas_transaction_log can will cause performance issues with the nas probe, and the NiS synch process will take a long time which affects making configuration changes. Do NOT drop the nas_transaction_log table. You can either run a delete statement and delete rows from that table based on a given date or TRUNCATE the table, but see the url listed below for a more solid and potentially permanent approach for you to consider which enables the nas housekeeping to complete its run and in turn minimizes the size of that table over time.
NAS_TRANSACTION_LOG table growing indefinitely
Tips and Techniques
- Try to avoid having all of your nas AO Profiles, preprocessing rules and scripts running on the Primary hub because eventually you will have unexpected results due to the load on the nas.
- Try to avoid using very frequent intervals, such as a large number of nas AO rules running every minute.
- Make sure that no 'custom' probes are generating unnecessary noise / traffic to the nas increasing its overhead. To analyze any custom probe/script take a close look at what the probe/script is doing in terms of alarms processing and decide if its optimal or not. In one case, we found a significant number of clear messages being sent unnecessarily, which added to nas overhead. nas AO was auto-acknowledging them and adding unneeded events to the transactionlog.db.
- Note that "on arrival" applies to the arrival of the message that triggered the AO. So if your script modifies the alarm, that alarm is put back on the bus and "arrives" again. So, essentially what you do in this situation is create a loop where every time you update the message it causes another update.
- The "On overdue age" option is vaguely named but translates to "Run once the alert is older than this." It fires only once because the alert is only ever older than the specified value once.
- Small values of overdue age e.g., 1-4 seconds are unreliable. It seems that there is a point where it takes a time stamp and then writes that to the local database and then it looks up everything that should run and if it took longer than that one second to get committed to the database, you'll lose it. In the scheme of things, 5 seconds is probably better than one.
- Also note that the nas only runs 1 profile at a time so if you have a bunch firing, you could see some unexpected results as some get delayed because of the others.
- Do not use or only use sparingly, the "On Interval" option.
- Ensure that all of your rules are 100% unique so that you don’t have any redundant/competing rules or rules that are in conflict with one another.
In some severe cases, where the number of open alarms in UIM has reached a very high number, e.g., 5000+ or even much more, it may require one or more of the following steps to make the message environment more stable. For example:
1. Reduce alarms, e.g., adjust hub/robot setting to eliminate/reduce robot inactive alarms, reduce the number of metrics being monitored that are not KPIs, nor have business reason/impact associated with the alarms/thresholds, etc. Any/all alarms with high counts need to be closely examined. This should be an ongoing effort until any unnecessary alarms and/or alarm counts are at a reasonable level.
2. Adjust the Transaction log retention to very low values so the transactionlog.db and database.db are reduced. In one large global environment, we set it to 1, 2 and 3 with 1-hour administration intervals (default), but perhaps 1, 5, and 15 may be enough - it depends on the size and scale of your global UIM environment.
The NiS Bridge settings can be set lower than their defaults of 7, 30 and 90, to 7, 15 and 30. Again, keep the 1 hour Administration interval if possible.
If you nas Administration ("housekeeping") as it is noted in the log, cannot complete within 1 hour, with reasonable settings, you probably still have too many alarms.
3. Start offloading nas AO profiles or preprocessing rules ASAP to one or more nas'es on remote hubs.
4. If the nas housekeeping job cannot finish, you'll see evidence of that in the nas.log at loglevel 5 (search for housekeeping). To alleviate this issue, you may have to manually reduce the size of some of the backend nas tables by deleting rows that are older than a given date/timestamp.
Once the nas tables are at a reasonable size, e.g., less than 1M rows, the housekeeping will most likely complete and the hourly administration should continue without issue but you should check it over a few days to be sure.
5. Increase monitoring intervals so that they are not unnecessarily creating alarms or increasing alarm count due to their frequency. For example, probes like snmpcollector monitoring a ton od devices every 3 or 5 minutes, cdm iostat monitoring every 1 minute and so forth.
6. In regards to the transaction log retention settings, the 'retention count down' doesn’t start until the active alarm is cleared/acknowledged. Often alarms with extreme suppression counts have originated and have a life that is way beyond this setting. In other words, for any given alarm, if the retention setting is 7 days, that 7 days doesn’t start until the active alarm is cleared/ack’d.
NAS NIS Bridge Administration (old data housekeeping) fails