How can we optimize nas probe performance and scalability to ensure stability and consistency?
How do we automate and ensure the success of nas maintenance and administration?
Unexpected alarm issues/anomalies can be caused by nas administration/maintenance failures due to the size of the local nas SQLite .db files (transactionlog.db > 300-500 MB), or backend nas tables growing too large (greater than 1 million rows).
Offload nas AO profiles/preprocessing rules from the Primary whenever possible. The inventory of alarms being sent to any single instance of the nas probe can be significantly reduced through the use of secondary nas'es - as you'll have each secondary nas probe managing a 'subset' of the overall alarm inventory and any associated rules and/or scripts.
If you choose to deploy remote/secondary nas’es to cut down on the primary nas overhead, once the secondary nas probes have been distributed and the Auto Operator logic migrated to preprocessor logic (as needed), then you'll need to set up forwarding / replication rules on each nas probe:
You can deploy a nas on each of the busiest hubs/tunnel concentrators. This nas would only run the AO Profiles and preprocessing scripts that prevent alarms from 'propagating.' For example, if it makes sense, you can close the alarms locally if they are generated by robots attached to that hub. Otherwise, you can use nas replication to move any open alarms to the Primary hub where the primary nas resides. This nas would then be responsible for running all remaining nas AO profiles/processing scripts.
Do not enable the NIS Bridge on ALL secondary hubs (including the HA backup hub). By default, it should not even be present if you just install the nas probe, but if you start by copying the nas.cfg from the primary hub nas, it will be there and it will be enabled.
Only a single nas can write to the backend nas table at one time. If a second nas NiS Bridge is enabled at the same time as the Primary nas is running, it will cause duplicate key errors in the nas log. For example:
nas: COM Error [0x80040e2f] IDispatch error #3119 - [Microsoft OLE DB Provider for SQL Server] Violation of UNIQUE KEY constraint 'UQ__NAS_TRAN__475AXXXXXXXF79D5'. Cannot insert duplicate key in object 'dbo.NAS_TRANSACTION_SUMMARY'. The duplicate key value is (XLXXXX1264-3XX86)
It is tempting to enable/'turn on' a lot of monitoring when you first deploy UIM but over time this can cause havoc. Best Practice is to only enable alarm thresholds for Key Performance Indicators (KPIs), that are associated with an upstream effect on business in some way. Try starting with a maximum of 5 KPIs per application/technology. Ask Support if we have any suggested KPIs for vmware, Citrix, Netapp, Nutanix, Exchange, etc. That is the base starting point - a small number of KPIs (key metrics). Always ask, why it’s important to collect the data (QOS or alarms), how often it needs to be collected and why, and how long it should be stored. Keep all of these monitoring aspects to a minimum. Note that some probes have monitoring enabled 'right out of the box' for many metrics - but customers must always decide which ones MUST be kept enabled versus the nice-to-have's.
UIM suppresses like alarms and updates the alarm count. But should we keep generating alarms in the hundreds/thousands letting the alarm counts increase exponentially? This is not a good practice as this can adversely affect nas performance, scalability and nas housekeeping (maintenance) as well as alarm displays and reliability, not to mention use of system resources as well. For any alarm suppression counts > 100, it begs the question, will the nas be able to handle more and more alarms with higher and higher counts? (not without running into performance/display or nas sync issues and even delays in alarm processing), or display issues in the alarm view. You should take the necessary time to review and adjust monitoring policy and process for alarms and related ticket handling.
Why allow this to happen if no actions will be taken to alleviate the issue/resolve the problem? For example, of what use is it to have 8000+ Robot Inactive alarms continuously being increased when nothing is being done about it. It's just 'noise' and it places an unnecessary load on the environment which usually worsens over time.
Note that in the nas GUI there are two tabs "Transaction Log" and "NiS Bridge" which independently set the alarm data retention rates for the tables. The "Transaction Log" tab sets the retention for the SQLite database. IF this is set lower than the retention in "NiS Bridge" then you will see less alarms...for example if the transaction log is set to 30 days but NiS Bridge is set to 90 days, when you drop the nas tables and re-sync, it will only be able to sync back 30 days of alarms because that's all that will be retained in the SQLite database.
In general, it is best to keep the size of the local nas database files, transactionlog.db less than 300 MB and the database.db (live alarms), less than 100 MB. Alarms live in two different places: In nas, (the alarm subconsole), active alarms are in the database.db file, and historical data (cleared alerts) in the transactionlog.db file. This is actually an SQLite database and a free SQLite browser can be used to query against and extract the data if necessary.
In very large UIM environments where there are frequent alarms and high counts, it is best to keep the transaction log settings as low as possible but not so low that it causes you to miss messages/activity that you need to troubleshoot.
In the Alarm Console in UMP/OC, the data is pulled from the NAS_ALARMS table (active alarms), the NAS_TRANSACTION_LOG table (transaction history) and NAS_TRANSACTION_SUMMARY table (transaction summary). So you can query against both of those 'transaction' tables, but be aware that the data will only be stored as long as the 'Transaction Log Management' time periods have been specified in the nas.
If any of the nas DB tables listed below become too large/or backend DB tables become larger than approx. 1M rows, you may start to see unexpected results, e.g., alarm delays, missed alarms, nas 'sync' issues, scripts not firing or not firing on time, alarms not being generated due to rules not firing, alarms not clearing intermittently, etc.
- NAS_ALARMS
- NAS_TRANSACTION_SUMMARY
- NAS_TRANSACTION_LOG
If one or more queries being run somehow involve the backend NAS tables, e.g., in UMP/USM/OC searching alarm history, the underlying query/java script may respond slowly or become hung and never finish, e.g., if the NAS_TRANSACTION_SUMMARY table is greater than 100M rows.
Note that millions of rows in the nas_transaction_log will cause performance issues with the nas probe, and the NiS synch process will take a long time. This also adversely affects making configuration changes.
You can either manually run a delete statement and delete rows from that table based on a given date or TRUNCATE the table, but see the url listed below for a more solid and potentially permanent approach for you to consider which enables the nas housekeeping to complete its run, and in turn minimizes the size of that table over time.
NAS_TRANSACTION_LOG table administration (housekeeping) unsuccessful and not being maintained
Try to avoid having all of your nas AO Profiles, preprocessing rules and scripts running on the Primary hub because eventually, you may have unexpected results due to the load.
Try to avoid using very frequent intervals, such as a large number of nas AO rules running every minute.
Make sure that no 'custom' probes are generating unnecessary noise / traffic to the nas increasing its overhead.
To analyze any custom probe/script take a close look at what the probe/script is doing in terms of alarm processing and decide if it's optimal or not. In one case, we found a significant number of clear messages being sent unnecessarily, which added to the nas overhead. In this case, the nas AO was auto-acknowledging them and adding unneeded events to the transactionlog.db.
Be aware that the nas option "On arrival" applies to the arrival of the message that triggered the AO. So if your script modifies the alarm, that alarm is put back on the message BUS and "arrives" again. So, essentially what you do in this situation is create a loop where every time you update the message it causes another update.
The "On overdue age" option is vaguely named but translates to "Run once the alert is older than x." It fires only once because the alert is only ever older than the specified value once.
Small values of overdue age e.g., 1-3 seconds may yield unexpected results. It seems that there is a point where it takes a timestamp and then writes that to the local database and then it looks up everything that should run and if it took longer than that one second to get committed to the database, you'll lose it. In the scheme of things, 5 or mmore seconds is probably better than trying to use 1-4 seconds.
Also, note that the nas only runs 1 profile at a time so if you have a bunch of profiles firing, you could see some unexpected results as some get delayed because of the others that are running.
Do not use, or only use sparingly, the "On Interval" option.
Ensure that all of your rules are 100% unique so that you don’t have any redundant/competing rules or rules that are in conflict with one another.
In some severe cases, where the number of open alarms in UIM has reached a very high number, e.g., 5000+ or even much more, one or more of the following steps may be required to make the message environment more stable. For example:
NAS NIS Bridge Administration (old data housekeeping) fails
For every single active alarm, there is 1 entry in the NAS_ALARMS and NAS_TRANSACTION_SUMMARY table, but in the NAS_TRANSACTION_LOG there is one row for every reoccurrence of an active alarm that is suppressed, at a minimum. For example, if there is one active alarm with a suppression count of 1000, that means there are, at a minimum, 1000 records/rows in the NAS_TRANSACTION_LOG table.
So, as a result, there is also a buildup of the local transactionlog.db which contributes to decreasing nas performance if the local transactionlog.db becomes too large, e.g., > 300MB or more. This may also come into play during nas housekeeping/cleanup or any query against that local transactionlog.db table on the file system. In other words, the nas housekeeping may not complete.
Also, every time you open or close the nas GUI, the nas 'sync' occurs. A very large transaction log table affects/delays this sync action and may also fill up disk space.
nas Transaction log houskeeping log entries look like this inthe nas.log when it is set to debug = 5
nas: Transaction-log database housekeeping used 27ms.
nas: Transaction-log database housekeeping scheduled to Tue Jun 13 00:30, 2023
In the nas GUI 'Status' Tab window, select the empty space and Rt-click, then choose 'Advanced' and select 'Reorganize' database.
At the same time, have the nas.log open in IM and you should see entries like the following:
nas: Nis-Bridge: Transaction-log administration succeeded deleting 5000 transaction entries older than 30 days.
nas: nisRun: finishing batch before NTL compression operations
nas: Nis-Bridge: Transaction-log administration succeeded compressing 5000 transaction entries older than 7 days.
nas: nisRun: finishing batch before NTS cleanup operations
nas: NiS-Bridge: Transaction-log administration used 19ms
For added performance you can try increase this setting, nis_trans_delete_size from 5000 to 10000 so that every 5 mins, 10k records will be deleted from the nas Transaction log (if they exist).
Then check the nas.log to see how many ms it used to complete the delete, e.g.,
nas: NiS-Bridge: Transaction-log administration used 19ms
If you see the oldest 'time' in the nas_transaction_log table when you run select top(1) time from nas_transaction_log lines up with the nas transaction log configuration setting, then the housekeeping is working as expected.
The resultant row datetime should line up with the nas configuration setting for nas history, e.g., 30 days.
If the local nas database.db contains a large number of alarms, e.g., > 3000, and the transactionlog.db is also large, e.g., GBs, with very high alarm counts for many alarms, one or more of the following efforts should be started as soon as possible to slowly but surely reduce alarm 'bloat' and allow the nas environment to function without any further random, intermittent, unexpected results/issues.
Document all alarms with high alarm counts, copy alarms with high counts > 500, into Excel for later reference during the alarm reduction project.
The bottom line regarding extremely high alarm counts is that they exist and persist usually because no one knows what to do to resolve it or cares enough about the alarm to resolve it, or has the time to address it. That said, there is no use in allowing the alarms to continue to persist and build up, e.g., for weeks or months, unless there is a solid plan in place to manage them.
Review NAS Architecture
The nas architecture should be reviewed and in all cases where the load on the main nas can be reduced, that should be addressed as per this nas best practices KB Article.
High-Level Alarm Policy or Principles
Any given alarm should have either a valid business reason, a technical reason, or both, and if it doesn't, it should be seriously considered for elimination.
Reduce alarm frequency
Reexamine the monitoring interval frequency and increase it if there isn't a good business reason to have it set to a low value. For example, CDM iostat monitoring every 1 or 5 minutes, website monitoring, url monitoring, CPU/Memory/Disk, device monitoring, vmware monitoring, and so on.
Administer defunct systems - For robot inactive alarms, implement a plan to effectively decommission any robots that need to be removed from the monitoring environment. Please refer to KB Articles on decommissioning robots.
Disk monitoring
If any tablespaces (datafiles) being monitored are set to Autoextend, only set up monitoring for when they are close to reaching their maximum size as 'Autoextend' is common in Oracle DBs.
Reduce alarm 'noise'
Eliminate any/all unnecessary alarm thresholds unless they serve some related business purpose or are required for business-critical resource monitoring.
CPU monitoring
Decide upon valid and appropriate threshold values for each OS platform, e.g., for Windows versus UNIX/Linux, e.g., 99% for 20 minutes/n samples on Windows, versus 99% for 5 minutes on UNIX/Linux.
Physical Memory
When 10% of available physical memory remains or whatever makes the most sense based on your requirements for each application/OS/server type or device.
Process monitoring - don't monitor a process very frequently unless there is a good reason to do so. Perhaps the reason may be temporary, peak periods, high-revenue season, etc.
Live alarms
Ideally for a single customer, Live alarms should be kept at what a UIM Administrator may consider a manageable level, e.g., less than 2k? That will depend on the environment and extensiveness of the monitoring footprint.
To start with, you could delete any alarms older than a week or two based on Origin Time. Then start doing monitoring governance on those alarms that no one is responding to - turn them off (disable the thresholds). The underlying root problem for poor or inconsistent nas performance is most commonly the total number of active alarms and alarms with very high counts, e.g., thousands, tens of thousands or hundreds of thousands.
nas administration and maintenance settings
Where are the alarms stored in UIM?
NAS nisqueue.db grows intermittently and alarms are delayed in OC (broadcom.com)