nas Best Practices, Tips and Techniques (for Large Environments)

search cancel

nas Best Practices, Tips and Techniques (for Large Environments)

book

Article ID: 192089

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM)

Issue/Introduction

How can we optimize nas probe performance and scalability to ensure stability and consistency?
How do we automate and ensure the success of the nas alarm server maintenance and administration?
Unexpected alarm issues/anomalies can be caused by nas administration/maintenance failures due to the size of the local nas SQLite .db files (transactionlog.db > 300-500 MB), or backend nas tables growing too large (millions of rows/hundreds of millions of rows).
Why is the nas taking a long time to synchronize?
Why is my OC alarm console history stuck in a 'Loading' state?
Why does the nas_transaction_log table contain a few hundred million rows?

Environment

Release: 8.51 or higher
Component: UIM - NAS

Cause

growth of monitored environment
nas housekeeping failures
nas configuration
too many alarms
alarm growth
age of alarms dating back months or years
nas table sizes/growth
nas performance/stability

Resolution

NAS Architecture
Monitoring Governance/Alarm Reduction
Local nas database files (transactionlog.db, database.db)
Backend nas database tables (NiS Bridge)
Nimsoft Alarm Server (NAS) Tips and Techniques

NAS Architecture

Offload nas AO profiles/preprocessing rules from the Primary whenever possible. The inventory of alarms being sent to any single instance of the nas probe can be significantly reduced through the use of secondary nas'es - as you'll have each secondary nas probe managing a 'subset' of the overall alarm inventory and any associated rules and/or scripts.

If you choose to deploy remote/secondary nas’es to cut down on the primary nas overhead, once the secondary nas probes have been distributed and the Auto Operator logic migrated to preprocessor logic (as needed), then you'll need to set up forwarding / replication rules on each nas probe:

On each secondary nas probe, Setup -> Forwarding / Replication, create a new rule to forward 'All events to destination' (one direction), with the destination alarm server being the Primary hub's nas probe. If the nas and alarm_enrichment are needed on a secondary hub, e.g., for Auto Operator rules and preprocessing then you should have NAS forwarding/replication enabled to move the alarms up to the Primary and not use ATTACH and GET queues - otherwise you may have unexpected alarm results, such as alarm dupes or alarms not clearing.
On the Primary nas probe, create a similar replication rule to forward 'As event responder' back to each secondary nas probe. (In reality, these replication queues should be built automatically from the first step, but double-check that they're constructed correctly).

You can deploy a nas on each of the busiest hubs/tunnel concentrators. This nas would only run the AO Profiles and preprocessing scripts that prevent alarms from 'propagating.' For example, if it makes sense, you can close the alarms locally if they are generated by robots attached to that hub. Otherwise, you can use nas replication to move any open alarms to the Primary hub where the primary nas resides. This nas would then be responsible for running all remaining nas AO profiles/processing scripts.

Do not enable the NIS Bridge on ALL secondary hubs (including the HA backup hub). By default, it should not even be present if you just install the nas probe, but if you start by copying the nas.cfg from the primary hub nas, it will be there and it will be enabled.

Only a single nas can write to the backend nas table at one time. If a second nas NiS Bridge is enabled at the same time as the Primary nas is running, it will cause duplicate key errors in the nas log. For example:

nas: COM Error [0x80040e2f] IDispatch error #3119 - [Microsoft OLE DB Provider for SQL Server] Violation of UNIQUE KEY constraint 'UQ__NAS_TRAN__475AXXXXXXXF79D5'. Cannot insert duplicate key in object 'dbo.NAS_TRANSACTION_SUMMARY'. The duplicate key value is (XLXXXX1264-3XX86)

Monitoring Governance/Alarm Reduction

It is tempting to enable/'turn on' a lot of monitoring when you first deploy UIM but over time this can cause havoc. Best Practice is to only enable alarm thresholds for Key Performance Indicators (KPIs), that are associated with an upstream effect on business in some way. Try starting with a maximum of 5 KPIs per application/technology. Ask Support if we have any suggested KPIs for vmware, Citrix, Netapp, Nutanix, Exchange, etc. That is the base starting point - a small number of KPIs (key metrics). Always ask, why it’s important to collect the data (QOS or alarms), how often it needs to be collected and why, and how long it should be stored. Keep all of these monitoring aspects to a minimum. Note that some probes have monitoring enabled 'right out of the box' for many metrics - but customers must always decide which ones MUST be kept enabled versus the nice-to-have's.

UIM suppresses like alarms and updates the alarm count. But should we keep generating alarms in the hundreds/thousands letting the alarm counts increase exponentially? This is not a good practice as this can adversely affect nas performance, scalability and nas housekeeping (maintenance) as well as alarm displays and reliability, not to mention use of system resources as well. For any alarm suppression counts > 100, it begs the question, will the nas be able to handle more and more alarms with higher and higher counts? (not without running into performance/display or nas sync issues and even delays in alarm processing), or display issues in the alarm view. You should take the necessary time to review and adjust monitoring policy and process for alarms and related ticket handling.

Why allow this to happen if no actions will be taken to alleviate the issue/resolve the problem? For example, of what use is it to have 8000+ Robot Inactive alarms continuously being increased when nothing is being done about it. It's just 'noise' and it places an unnecessary load on the environment which usually worsens over time.

Local nas database files (transactionlog.db, database.db)

Note that in the nas GUI there are two tabs "Transaction Log" and "NiS Bridge" which independently set the alarm data retention rates for the tables. The "Transaction Log" tab sets the retention for the SQLite database. IF this is set lower than the retention in "NiS Bridge" then you will see less alarms...for example if the transaction log is set to 30 days but NiS Bridge is set to 90 days, when you drop the nas tables and re-sync, it will only be able to sync back 30 days of alarms because that's all that will be retained in the SQLite database.

In general, it is best to keep the size of the local nas database files, transactionlog.db less than 300 MB and the database.db (live alarms), less than 100 MB. Alarms live in two different places: In nas, (the alarm subconsole), active alarms are in the database.db file, and historical data (cleared alerts) in the transactionlog.db file. This is actually an SQLite database and a free SQLite browser can be used to query against and extract the data if necessary.

In very large UIM environments where there are frequent alarms and high counts, it is best to keep the transaction log settings as low as possible but not so low that it causes you to miss messages/activity that you need to troubleshoot.

Backend nas database tables (NiS Bridge)

In the Alarm Console in the Operator Console (OC), the data is pulled from the NAS_ALARMS table (active alarms), the NAS_TRANSACTION_LOG table (transaction history) and NAS_TRANSACTION_SUMMARY table (transaction summary). So you can query against both of those 'transaction' tables, but be aware that the data will only be stored as long as the 'Transaction Log Management' time periods have been specified in the nas.

If any of the nas DB tables listed below become too large/or backend DB tables become larger e.g., millions of rows, you may start to see unexpected results, e.g., alarm delays, missed alarms, nas 'sync' issues, scripts not firing or not firing on time, alarms not being generated due to rules not firing, alarms not clearing intermittently, etc.

- NAS_ALARMS
- NAS_TRANSACTION_SUMMARY
- NAS_TRANSACTION_LOG

If one or more queries being run somehow involve the backend NAS tables, e.g., in OC searching alarm history, the underlying query/java script may respond slowly or become hung and never finish, e.g., if the NAS_TRANSACTION_SUMMARY table or NAS_TRANSACTION_LOG is greater than 100M rows.

Note that millions of rows in the nas_transaction_log will cause performance issues with the nas probe, and the NiS synch process will take a very long time. This can also delay nas GUI sync, or adversely affect making configuration changes.

You can either manually run a delete statement and delete rows from that table based on a given date or TRUNCATE the table, but see the url listed below for a more solid and potentially permanent approach for you to consider which enables the nas housekeeping to complete its run, and in turn minimizes the size of that table over time.

NAS_TRANSACTION_LOG table administration (housekeeping) unsuccessful and not being maintained

In the meantime, one notable approach includes the following:

Production Example

In one large environment, approx. 26.4M rows were being added to the NAS_TRANSACTION_LOG within 30 days.

For your own environment example, if there are over 100M rows in the nas_transaction_log table and a lot of rows are consistently being added based on the NiS Bridge retention setting, e.g., 30 day time period, you may eventually see some odd alarm-related issues as mentioned in the KB Article introduction.

--List the number of rows in the NAS_TRANSACTION_LOG TABLE within a given time period
SELECT COUNT(*) AS RowsAddedInLast30Days FROM nas_transaction_log
WHERE time >= DATEADD(day, -30, GETDATE());

After backing up/archiving the table (if required), then truncating the NAS_TRANSACTION_LOG table,

TRUNCATE TABLE NAS_TRANSACTION_LOG

Set it from the current value of 30 down to 4 but 5-7 days which makes more sense to try first so you can retain an entire weeks log entries.

At the same time, you can confirm the nas housekeeping continues to work after a weeks worth of rows are added.

Do this by setting up a logmon profile to monitor the nas.log over time see if the housekeeping fails. The logmon Watcher profile should contain a message filter to parse the log for:

   /.*Transaction-log administration, failed to remove transaction entries.*/

Make sure you check the box to generate an alarm.

Recheck the row count at that point to see how high it reached and then adjust the setting to lessen the row count if needed.

Then the housekeeping should finish successfully if you keep the row count under control. If the table gets too large its very difficult and takes a lot of time to reduce its size via DELETE statements even when using a TOP clause.

Nimsoft Alarm Server (NAS) Tips and Techniques

Try to avoid having all of your nas AO Profiles, preprocessing rules and scripts running on the Primary hub because eventually, you may have unexpected results due to the load.
Try to avoid using very frequent intervals, such as a large number of nas AO rules running every minute.
Make sure that no 'custom' probes are generating unnecessary noise / traffic to the nas increasing its overhead.
To analyze any custom probe/script take a close look at what the probe/script is doing in terms of alarm processing and decide if it's optimal or not. In one case, we found a significant number of clear messages being sent unnecessarily, which added to the nas overhead. In this case, the nas AO was auto-acknowledging them and adding unneeded events to the transactionlog.db.
Be aware that the nas option "On arrival" applies to the arrival of the message that triggered the AO. So if your script modifies the alarm, that alarm is put back on the message BUS and "arrives" again. So, essentially what you do in this situation is create a loop where every time you update the message it causes another update.
The "On overdue age" option is vaguely named but translates to "Run once the alert is older than x." It fires only once because the alert is only ever older than the specified value once.
Small values of overdue age e.g., 1-3 seconds may yield unexpected results. It seems that there is a point where it takes a timestamp and then writes that to the local database and then it looks up everything that should run and if it took longer than that one second to get committed to the database, you'll lose it. In the scheme of things, 5 or mmore seconds is probably better than trying to use 1-4 seconds.
Also, note that the nas only runs 1 profile at a time so if you have a bunch of profiles firing, you could see some unexpected results as some get delayed because of the others that are running.
Do not use, or only use sparingly, the "On Interval" option.
Ensure that all of your rules are 100% unique so that you don’t have any redundant/competing rules or rules that are in conflict with one another.

Stabilizing the Messaging Environment

In some severe cases, where the number of open alarms in UIM has reached a very high number, e.g., 5000+ or even much more, one or more of the following steps may be required to make the message environment more stable. For example:

Reduce alarms

Adjust robot settings to reduce robot inactive alarms, and metrics that are not true KPIs, nor have business reason/impact associated with the alarm.
Any/all alarms with high counts need to be closely examined. This is an ongoing effort until unnecessary alarms/alarm counts are at a reasonable (minimized) level.
Adjust the Transaction log retention to lower values if necessary due to transactionlog.db size.

This is so the transactionlog.db and database.db are reduced. In some cases, if the transactionlog.db is greater than 300MB, it can cause unexpected results with the nas as well as affect the Operator Console when viewing alarm history.

In one large global environment, we set it to 1, 2, & 3 with 1-hour admin intervals, but perhaps 1, 5, and 15 may be enough. This depends on the size/scale of your environment.

The NiS Bridge settings can be set lower than their defaults of 7, 30 and 90, to 7, 15 and 30. Again, keep the 1-hour Administration interval if possible or reduce it.
Offload nas workload, for example nas AO profiles or preprocessing rules ASAP to 1 or more nas'es on secondary hubs.
Nas "housekeeping"

If nas housekeeping cannot finish, you'll see evidence of success or failure in the nas.log at loglevel 5, use a logsize of 200000. Search the log using 'housekeeping.'
Manually reduce the size of the backend nas tables.

This can be done by deleting rows that are older than a given date.
Once the tables are a reasonable size, e.g., <1M rows, housekeeping will normally complete, and administration should continue without issue.
Increase monitoring intervals wherever possible so they are not creating frequent 'unserviceable' or 'redundant' alarms, or increasing alarm counts due to frequency.

For example, probes like snmpcollector that are monitoring numerous devices every 5 minutes, cdm iostat disk monitoring, etc. will contribute to this problem.

Note that for transaction log retention settings, the 'retention countdown' doesn’t start until the active alarm is cleared/acknowledged. Often alarms with extreme suppression counts have a time origin and a life that is well beyond this setting. In other words, for any given alarm, if the retention setting is 7 days, that 7 days doesn’t start until the active alarm is cleared.

NAS NIS Bridge Administration (old data housekeeping) fails

Additional comments:

Depending on the size of your environment, if you generate a lot of alarms over time, lower nas transaction settings would be best. Out of the box they are set too high for a larger environment.
Lower settings keep the database.db and transactionlog.db smaller so the nas can work more efficiently. as SQLite doesn't scale that well when the local nas files become very large.
Also, with 7, 30, and 90, in the nas nis bridge the nas tables on the backend can adversely affect OC performance as well.
For the local files/transaction log settings, it may help to set the retention to 3, 7, and 14 days, or 1, 5 and 15 as mentioned previously. This will keep the database.db and transactionlog.db smaller.

IMPORTANT: Keep the nas retention settings the same between the Transaction Log and NiS Bridge Tab settings in the nas GUI as the NiS Bridge message data storage retention depends upon the 'Transaction Log' settings.

nas 'housekeeping' can be run successfully because if it fails you end up with a lot of alarms due to the housekeeping process not finishing. Once you lower the settings and manually delete rows based on date, nas housekeeping can run without issues. Check the table row counts:

select count(*) from nas_transaction_summary

select count(*) from nas_alarms

select count (*) from nas_transaction_log

Under normal circumstances, to workaround the issue of the housekeeping not completing, run a delete statement against the nas_alarms, nas_transaction_summary and nas_transaction_log tables to cut down the rows based on date. This is assuming that the table sizes have not grown so large that they have reached over 100M rows.

-- First you can check using a SELECT statement, to delete from a table based on date, instead of deleting all of the entries. Change the date below to suit your needs.

select * from nas_transaction_summary where created < '2023-05-01 00:00:00.000'

-- Then delete the entries you don't need past a certain date. For example:

delete from nas_transaction_summary where created < '2023-05-01 00:00:00.000'

select * from nas_transaction_summary (check the results)

Example for nas transaction log:

select * from nas_transaction_log where time < '2023-05-01 00:00:00.000'

If you need to save a lot more alarms due to regulations or some other business or auditing purposes, e.g., up to 365 days worth, then you might consider archiving them in a different system, or exporting them to an external database but that is uncommon unless your industry regulations/compliance requires it.

nas Transaction Log versus NiS Bridge

Note that in the nas GUI there are two tabs "Transaction Log" and "NiS Bridge" which independently set the alarm data retention rates for the tables. The "Transaction Log" tab sets the retention for the SQLite database. Note that IF this is set lower than the retention in "NiS Bridge" then you will see less alarms... for example if the transaction log is set to 30 days but NiS Bridge is set to 90 days, when you drop the tables and re-sync, it will only be able to sync back 30 days of alarms because that's all that will be retained in the SQLite database.

Why do high alarm 'Counts' adversely affect the nas probe functionality, scalability and/or performance?

For every single active alarm, there is 1 entry in the NAS_ALARMS and NAS_TRANSACTION_SUMMARY table, but in the NAS_TRANSACTION_LOG there is one row for every reoccurrence of an active alarm that is suppressed, at a minimum. For example, if there is one active alarm with a suppression count of 1000, that means there are, at a minimum, 1000 records/rows in the NAS_TRANSACTION_LOG table.

So, as a result, there is also a buildup of the local transactionlog.db which contributes to decreasing nas performance if the local transactionlog.db becomes too large, e.g., > 300MB or more. This may also come into play during nas housekeeping/cleanup or any query against that local transactionlog.db table on the file system. In other words, the nas housekeeping may not complete.

Also, every time you open or close the nas GUI, the nas 'sync' occurs. A very large transaction log table affects/delays this sync action and may also fill up disk space.

Managing the nas Transaction Log table

nas Transaction log houskeeping log entries look like this inthe nas.log when it is set to debug = 5

   nas: Transaction-log database housekeeping used 27ms.
   nas: Transaction-log database housekeeping scheduled to Tue Jun 13 00:30, 2023

In the nas GUI 'Status' Tab window, select the empty space and Rt-click, then choose 'Advanced' and select 'Reorganize' database.

At the same time, have the nas.log open in IM and you should see entries like the following:

   nas: Nis-Bridge: Transaction-log administration succeeded deleting 5000 transaction entries older than 30 days. 
   nas: nisRun: finishing batch before NTL compression operations 
   nas: Nis-Bridge: Transaction-log administration succeeded compressing 5000 transaction entries older than 7 days. 
   nas: nisRun: finishing batch before NTS cleanup operations 
   nas: NiS-Bridge: Transaction-log administration used 19ms

For added performance you can try increase this setting, nis_trans_delete_size from 5000 to 10000 so that every 5 mins, 10k records will be deleted from the nas Transaction log (if they exist).

Then check the nas.log to see how many ms it used to complete the delete, e.g.,

   nas: NiS-Bridge: Transaction-log administration used 19ms

If you see the oldest 'time' in the nas_transaction_log table when you run select top(1) time from nas_transaction_log lines up with the nas transaction log configuration setting, then the housekeeping is currently working as expected.

The resultant row datetime should line up with the nas configuration setting for nas history, e.g., 30 days.

Monitoring Governance - Recommendations

Reduce alarms/alarm counts

If the local nas database.db contains a large number of alarms, e.g., > 3000, and the transactionlog.db is also large, e.g., GBs, with very high alarm counts for many alarms, one or more of the following efforts should be started as soon as possible to slowly but surely reduce alarm 'bloat' and allow the nas environment to function without any further random, intermittent, unexpected results/issues.

Document all alarms with high alarm counts, copy alarms with high counts > 500, into Excel for later reference during the alarm reduction project.

The bottom line regarding extremely high alarm counts is that they exist and persist usually because no one knows what to do to resolve it or cares enough about the alarm to resolve it, or has the time to address it. That said, there is no use in allowing the alarms to continue to persist and build up, e.g., for weeks or months, unless there is a solid plan in place to manage them.

Review NAS Architecture

The nas architecture should be reviewed and in all cases where the load on the main nas can be reduced, that should be addressed as per this nas best practices KB Article.
High-Level Alarm Policy or Principles

Any given alarm should have either a valid business reason, a technical reason, or both, and if it doesn't, it should be seriously considered for elimination.
Reassess and adjust thresholds

Best practice is to enable baselining for KPIs (3-5) to identify the normal-expected behavior of any business/technical metric and after the baseline is developed, adjust (fine-tune) the threshold values to significantly reduce alarm frequency especially if the threshold is too low causing ongoing threshold breaches.
Reduce alarm frequency

Reexamine the monitoring interval frequency and increase it if there isn't a good business reason to have it set to a low value. For example, CDM iostat monitoring every 1 or 5 minutes, website monitoring, url monitoring, CPU/Memory/Disk, device monitoring, vmware monitoring, and so on.
Administer defunct systems - For robot inactive alarms, implement a plan to effectively decommission any robots that need to be removed from the monitoring environment. Please refer to KB Articles on decommissioning robots.
Disk monitoring

If any tablespaces (datafiles) being monitored are set to Autoextend, only set up monitoring for when they are close to reaching their maximum size as 'Autoextend' is common in Oracle DBs.
Reduce alarm 'noise'

Eliminate any/all unnecessary alarm thresholds unless they serve some related business purpose or are required for business-critical resource monitoring.
CPU monitoring

Decide upon valid and appropriate threshold values for each OS platform, e.g., for Windows versus UNIX/Linux, e.g., 99% for 20 minutes/n samples on Windows, versus 99% for 5 minutes on UNIX/Linux.
Physical Memory

When 10% of available physical memory remains or whatever makes the most sense based on your requirements for each application/OS/server type or device.
Process monitoring - don't monitor a process very frequently unless there is a good reason to do so. Perhaps the reason may be temporary, peak periods, high-revenue season, etc.
Live alarms

Ideally for a single customer, Live alarms should be kept at what a UIM Administrator may consider a manageable level, e.g., less than 2k? That will depend on the environment and extensiveness of the monitoring footprint.
You can undertake efforts to reduce the overall load on the nas but if no monitoring governance is undertaken, and alarms are not resolved, and/or they never will be, the alarm thresholds should be disabled otherwise the alarms will simply continue to build up day after day. Generally, alarms are simply 'noise' unless some plan of action is put into place to handle/address the reason they occurred in the first place.
To start with, you could delete any alarms older than a week or two based on Origin Time. Then start doing monitoring governance on those alarms that no one is responding to - turn them off (disable the thresholds). The underlying root problem for poor or inconsistent nas performance is most commonly the total number of active alarms and alarms with very high counts, e.g., thousands, tens of thousands or hundreds of thousands.