Optimizing DX UIM Operator Console - Performance troubleshooting guide
search cancel

Optimizing DX UIM Operator Console - Performance troubleshooting guide

book

Article ID: 127497

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

  • Users may experience slow performance in the DX UIM Operator Console (OC), particularly during 'Loading...' times and when displaying metrics. This delay can manifest in various OC webapps, including PRD reports, List Views, and Metrics Viewer displays.

  • This article provides a concise checklist to help users identify and address the factors affecting OC's response times. It's essential to work closely with your Database Administrator (DBA), especially concerning the database server, its maintenance, and administration. The overarching question this article aims to address is: How can we optimize OC response times?

    • Slow response times may be seen in either one or more OC webapps such as PRD reports, List Views, or Metrics Viewer displays.

    • When loading is slow, this information can help customers and support collect and analyze more complete data to be used in optimizing OC performance and/or eliminating OC slowness or performance issues.

    • Moreover, customers should work closely with their DBA regarding most of the factors listed below regarding the database server, DB maintenance, and administration.

    • How can we analyze/optimize OC response times?

Environment

  • DX UIM 20.3 or higher
  • Operator Console (OC)

Cause

  • Guidance

Resolution

 


Analyzing operator Console slowness

  • Try to identify/pinpoint where the Operator Console slowness exists, e.g., 'Loading...' data phase? or for specific views, reports, specific probe metrics, etc.? PRD/List reports?

    • How many seconds/minutes is the data presentation delayed?

    • Is the performance issue consistent, intermittent, or 'random'?

    • What are the circumstances under which the issue happens?

    • Can the issue be reproduced?


System Requirements and Configurations

  • DX UIM, Operator Console (OC), and Database Server sizing and hardware requirements followed?

  • Type/level of data storage, Tier 0, Tier 1, Tier2? (Tier 1 is recommended)

    • Tier 1 storage type is appropriate for mission-critical data. You can use fast drives, all-flash storage (AFA), and hybrid-flash storage to store data in tier 1.

    • This tier stores backup for mission-critical and business-critical data. Tier 1 data storage is designed for data that is highly time-sensitive, and volatile, and must be accessed quickly—in as close to real-time as possible.

  • SSD drives/RAID 5 or higher drives?


Log and Debugging Information

  • Examine the data_engine log at the time that corresponds to the slow performance.

    • loglevel should be set to at least 3, higher debug is available at loglevel 5 and recommended.

    • Set a data_engine logsize parameter of at least 300000

    • Have you checked the true memory consumption by wasp via the logs at loglevel 5?

      • Note that it's best not to configure the wasp with a lot more java heap memory than it actually needs.

      • Check for 'OutOfMemory' exceptions, as well as how much memory is available after wasp startup, and use that as a guideline to set the java min and max memory and configure it with at least 1-2GB 'breathing room.'


MS SQL Server UIM Databases

  • Make sure the latest cumulative Microsoft SQL Server update has been applied to the database server, e.g., CU22 as of Sept. 2023.

  • Has the database server been adjusted according to documented best practices for MS SQL Server for DX UIM?

  • What SQL Server driver is configured and being used in the data_engine?

  • Has the large tables query been run to confirm how many tables are large and what their row count is?

    Query to determine large tables:

    SELECT
     t.NAME AS TableName,
     s.Name AS SchemaName,
     p.rows AS RowCounts,
     SUM(a.total_pages) * 8 AS TotalSpaceKB,
     SUM(a.used_pages) * 8 AS UsedSpaceKB,
     (SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
    FROM
     sys.tables t
    INNER JOIN
     sys.indexes i ON t.OBJECT_ID = i.object_id
    INNER JOIN
     sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
    INNER JOIN
     sys.allocation_units a ON p.partition_id = a.container_id
    LEFT OUTER JOIN
     sys.schemas s ON t.schema_id = s.schema_id
    WHERE
     t.NAME NOT LIKE 'dt%'
     AND t.is_ms_shipped = 0
     AND i.OBJECT_ID > 255
    GROUP BY
     t.Name, s.Name, p.Rows
    ORDER BY
     RowCounts DESC

Recommendations

  • Database Type (Enterprise versus Standard)

    • Enterprise Edition is highly recommended. (The UIM data_engine ONLY supports table partitioning with Enterprise editions.)


System Updates


Database Maintenance and Checks

  • Have you/your DBA checked the size of the Database Transaction Log file? Is it being managed?

    • A DBA should be maintaining the database if 'Full Recovery' is configured

  • Any Blocking or Locks in the Database server at the time of issue?

    • For MS SQL Server, please consult with your DBA on the use of the following suggestions:

  • For MS SQL Server, to list the transaction log space usage stats, run the command DBCC SQLPERF(logspace) using SQL Server Studio (SSMS).

    • List transaction log space usage statistics for all databases. See results-> log space Used in %.

   DBCC SQLPERF(logspace)

-- Find locked tables and queries causing any locking issues.

SELECT  L.request_session_id AS SPID, 

      DB_NAME(L.resource_database_id) AS DatabaseName,

      O.Name AS LockedObjectName, 

      P.object_id AS LockedObjectId, 

      L.resource_type AS LockedResource, 

      L.request_mode AS LockType,

      ST.text AS SqlStatementText,        

      ES.login_name AS LoginName,

      ES.host_name AS HostName,

      TST.is_user_transaction as IsUserTransaction,

      AT.name as TransactionName,

      CN.auth_scheme as AuthenticationMethod

FROM    sys.dm_tran_locks L

      JOIN sys.partitions P ON P.hobt_id = L.resource_associated_entity_id

      JOIN sys.objects O ON O.object_id = P.object_id

      JOIN sys.dm_exec_sessions ES ON ES.session_id = L.request_session_id

      JOIN sys.dm_tran_session_transactions TST ON ES.session_id = TST.session_id

      JOIN sys.dm_tran_active_transactions AT ON TST.transaction_id = AT.transaction_id

      JOIN sys.dm_exec_connections CN ON CN.session_id = ES.session_id

      CROSS APPLY sys.dm_exec_sql_text(CN.most_recent_sql_handle) AS ST

WHERE   resource_database_id = db_id()

ORDER BY L.request_session_id

  • Use the script: sp_blocker_pss08 or SQL Trace/Profiler and the Blocked Process Report event class.

  • Check for fragmentation of key tables (CM_*, S_QOS_*, NAS_*)

    • A DAILY job should be run to defrag specific tables. For more information please refer to:

      DX UIM - data_engine indexing option - Index Maintenance

      Specifically, the following queries may help defragment the indices on commonly used tables for Operator Console - running these queries never hurts and often helps:

      ALTER INDEX ALL ON CM_COMPUTER_SYSTEM REBUILD;
      ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ATTR REBUILD;
      ALTER INDEX ALL ON CM_DEVICE REBUILD;
      ALTER INDEX ALL ON CM_DEVICE_ATTRIBUTE REBUILD;
      ALTER INDEX ALL ON CM_CONFIGURATION_ITEM REBUILD;
      ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC REBUILD;
      ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_DEFINITION REBUILD;
      ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC_DEFINITION REBUILD;
      ALTER INDEX ALL ON CM_NIMBUS_ROBOT REBUILD;
      ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ORIGIN REBUILD;
      ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_ATTRIBUTE REBUILD;
      ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CI REBUILD;
      ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CS REBUILD;
      ALTER INDEX ALL ON CM_RELATIONSHIP_CS_CI REBUILD;
      ALTER INDEX ALL ON CM_DISCOVERY_NETWORK REBUILD;
      ALTER INDEX ALL ON S_QOS_DATA REBUILD;
      ALTER INDEX ALL ON NAS_TRANSACTION_SUMMARY REBUILD;
      ALTER INDEX ALL ON NAS_TRANSACTION_LOG REBUILD;
      ALTER INDEX ALL ON NAS_ALARMS REBUILD;
  • Has the DBA/VMware admin checked SQL Server 'memory pressure' on the database server over time?

  • How many large tables exist? Large tables (> 50 million rows)

  • Has the partitioning query been run to confirm that partitioning is working?


Resource and Hardware Monitoring

  • Check resource utilization trend on the database server and OC machine, CPU, Memory, Disk I/O, network speed/available bandwidth, proxy?

  • Enough memory dedicated to SQL Server, e.g., 64 GB RAM or more?


Connectivity Checks

  • Is the database server on the same subnet as the Primary hub? This is essential for Primary hub <->Database server connectivity and performance.

  • The database server should NOT be on a different subnet than the Primary and  OC. This is not recommended.

  • When you run a traceroute/tracert command is it only 1-2 hops away from the Primary?


Operator Console (OC) Robot Checks

  • OC Robot processor speed? Ideally, it should be 3 GHz or higher.

  • What is the number of virtual processors configured on the OC Robot VM?

    • Please ensure it is 4 virtual processors or higher.


Monitoring Practices

  • Checked and made sure that Nimsoft/UIM is completely excluded from AV blocking/scanning?

    • On Windows, check Application or System log events to make sure there is nothing being blocked.

    • Note that even Informational level log events may reveal a block, interference, lock or scan.

  • Checked OC Robot wasp java heap memory utilization via the wasp.log versus what is configured in the wasp.cfg. 

    • Confirmed wasp memory configured, versus what is being used over time?

      • For example, using the processes probe, configure a wasp java process profile memory utilization in PRD.

  • Monitoring Governance considerations:

    • Too much non-critical/unnecessary QOS data or alarms being collected?

    • Monitoring/polling intervals too low (frequent) for some QOS?

  • Tested slowness issue using different browsers after deleting the browser cache and closing all browsers, then starting a fresh new browser session?

Additional Information