Troubleshooting Missing Data / Metrics in DX UIM

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe

Issue/Introduction

Some of your widgets inside a custom dashboard, Metric Views, or other reports available in the Operator Console are missing data and metrics as you would expect.

This Article has steps on how you can troubleshoot missing data in the DX UIM Operator Console (OC)

Environment

DX UIM 20.4.* / 23.4

Cause

Guidance

Resolution

1) Database

■ The first thing to do is to query the database directly to see if the data exists. This can be accomplished via the SLM portlet or the SQL interface on your database server.
There are two tables that should be checked; S_QOS_DATA and the corresponding RN table(s). The first query checks the S_QOS_DATA table to ensure that we are receiving and processing QOS_DEFINITION messages.

SELECT * FROM S_QOS_DATA WHERE probe = '<probe name>';
If no data was returned, jump to TOPIC 2 and 3 below for troubleshooting hub queues and data_engine.

NOTE : probe = 'pollagent' should be used if you check QOS data produced by snmpcollector probe.

You will want to note the table_id and r_table fields from the above query for the second query:
SELECT * FROM '<r_table>' WHERE table_id = <table_id> ORDER BY sampletime DESC;
If there is no last 24 hours' data, jump to TOPIC 2 and 3 below.
If there is at least data in the last 24 hours, go to TOPIC 4. NOTE : If there is no data for the last 24 hours, OC won't show you metrics in the device views.

■ The below query will also help identify if data is in the DB. The query returns all current QOS metrics being collected by/for a given machine.

SELECT
def.source,
def.probe,
cb.met_description,
def.target,
def.qos,
snap.samplevalue,
snap.sampletime,
def.r_table,
def.table_id
FROM S_QOS_DATA AS def
join cm_configuration_item_metric cm on cm.ci_metric_id=def.ci_metric_id
join cm_configuration_item_metric_definition cb on cm.ci_metric_type=cb.met_type
JOIN S_QOS_SNAPSHOT AS snap ON snap.table_id = def.table_id
WHERE snap.[sampletime] > dateadd(hour, -1, getdate())
--AND qos like '%QOS_CPU_USAGE%'
AND source like '%servername%'order by def.qos asc

Query options:

-- replace '%servername%'with the source name of the device you are investigating.

-- the time specification in the query can be modified to adjust the period to investigate. If the query returns a result it will always be the latest sample gathered in that time frame.

Examples: "dateadd(hour, -1, getdate())" will get last hour

"dateadd(day, -1, getdate())" will get last day

"dateadd(week, -2, getdate())" will return last 2 weeks

-- Uncomment the line "AND qos like '%<QOS_CPU_USAGE>%' to narrow down the result to a specific QOS Metric. Replace your QOS metric name. In the example it is querying QOS_CPU_USAGE.

(To uncomment the line, remove "--". Running the query with the comment "--" will ignore the commented line. If you remove the comment it will run)

2) Hub queues

The parent hub of the robot needs to have an “ATTACH” queue that, at minimum, listens for QOS_MESSAGE,QOS_DEFINITION messages.
The hub that retrieves data from that hub also needs a queue as stated above unless that hub is the primary hub, at which point the data_engine probe creates its own listening queue.
If the hubs are not configured with the queues, then those queues need to be created along with the corresponding “GET” queues and the probe needs to be restarted.
Here is the documentation on hub queues:
https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/GA/monitoring/infrastructure-core-components/hub.html

23.4 Specific

Make sure that a message type ROBOT_NIS_CACHE is added to an existing queue on all hubs reporting upstream to the primary. This is required when downstream robots are updated to 23.4 and above.

3) data_engine

Provided that the queues are set up correctly, the next place to inspect is data_engine. data_engine has two jobs related to this topic. One job is to prepare the schema by setting up the entries in S_QOS_DEFINITION and S_QOS_DATA. These are created by QOS_DEFINITION messages generated by the probe on startup.

If you didn't see the entries in the S_QOS_DATA table in the query above then you'll see errors in the data_engine log when you restart the problem probe. You'll want the data_engine log at level 3 to catch these. However, if no errors are found, sometimes, deactivating the data_engine probe and activating it again could make a difference.

If you have already seen data in the S_QOS_DATA table then QOS_DEFINITION messages are being processed and set up correctly. In that case, there may be a problem with the QOS definition that won’t allow it to save the monitored data. We sometimes see issues when the definition has been set up with a 'hasmax' value but the probe isn’t sending data with a max value. Again, this will be logged in the data_engine log. The steps to fix, depend upon the situation and a support ticket is probably the best way to approach this.

4) Operator Console (OC)

Usually, if it is showing up in the database, then it should be showing up in Performance Reports Designer (PRD) as well as PRDs are very aligned with the S_QOS_DATA table. It’s a good idea to double-check a PRD to make sure it will graph your data, however, we most often see problems in OC.

If the issue is that it's not seen in OC, then there could be 3 problems:

a) The device doesn't exist in the Inventory

If the device doesn't exist in inventory, then it could be a failure on the discovery_server's part. There are a few reasons why this might happen. Depending on the probe architecture, it could be a queue issue or it could be an inability of the discovery_server to contact the robot that the probe is installed.

Probes that rely on discovery queues to publish inventory data are:

cisco_ucs
clarion
cm_data_import
discovery_agent
hyperv
ibmvm
icmp
mongodb_monitor
salesforce
snmpcollector
vmware
xenserver

and other storage probes in general.

In this case, it is necessary to ensure that there are discovery queues in place to pass the discovery messages up to your Primary hub. The parent hub of the robot needs to have an “ATTACH” queue that listens for probe_discovery messages.

The hub that retrieves data from that hub also needs a queue as stated above unless that hub is the primary hub, at which point the discovery_server creates its own listening queue.
If the hubs are not configured with the queues, then those queues need to be created along with the corresponding “GET” queues and the probe needs to be restarted.

Here is the documentation on hub queues:
Configure Queues and Tunnels (broadcom.com)

If the problem probe is not one of the above it could be a failure of discovery_server to be able to contact the robot. If you restart discovery_server and watch the logs on level 5, you'll see discovery_server reporting problems contacting the robot. This is unusual but could be caused by a firewall blocking communications.

Necessary inventory data might not be getting saved into UIM database since the device have been placed into excluded_devices.csv (in discovery_server probe folder)
If device is found in the file, discovery_server will not process inventory data for the device.

Please find section "Allow Rediscovery of Deleted devices" in the below link.

Remove Devices From the Inventory (broadcom.com)

b) There are correlation problems with devices in inventory and the data is matched to an unexpected entry or the data is attached to an unexpected device.

There are many tables that rely on JOIN statements to form a complete chain from CM_COMPUTER_SYSTEM to S_QOS_DATA. This will verify that this chain is complete.

Log back into the database to run some queries

SELECT * FROM S_QOS_DATA WHERE probe = '<probe name>';
Choose one of those results and copy the ci_metric_id value. Then run the following query.
SELECT * FROM CM_CONFIGURATION_ITEM_METRIC WHERE ci_metric_id = '<ci_metric_id>';
If data is not returned, jump down to TOPIC C
If data is returned, take the ci_id value from the returning record and run
SELECT * FROM CM_CONFIGURATION_ITEM WHERE ci_id = '<ci_id>';
Then take the dev_id from the returning record and run
SELECT * FROM CM_DEVICE WHERE dev_id = '<dev_id>';
Then take the cs_id from the returned record and run
SELECT * FROM CM_COMPUTER_SYSTEM WHERE cs_id = '<cs_id>';

This will return the entry in OC that you will find the QOS data listed under. Sometimes, it is not the device you are expecting.

c) There is a ci_metric_id mismatch

ci_metric_id mismatches can be figured out fairly quickly. The first step is to go to the robot, clear out the niscache folder and restart the robot. This ensures that we don't have an old robot device ID, which all metric IDs are ultimately based on. This commonly happens on cloned VMs that already have a robot installed on them.
Then pull up DrNimbus and watch for any QOS_MESSAGE from the target probe. When you see a message from that probe, click on it. Look for a field called met_id. You’ll need to manually type the met_id into your query below as DrNimbus does not allow copy/paste.

SELECT * FROM S_QOS_DATA WHERE ci_metric_id = '<met_id>';
If this query doesn't return data, then you need to
UPDATE S_QOS_DATA SET ci_metric_id = NULL WHERE probe = '<probe name>';

Note for Oracle backend database: Make sure to run the commit sql command to permanently save the change

Then restart data_engine and wait for the probe to send metrics again. Check USM and see if your data shows up.

*** If your environment is UIM 8.4 SP2 or greater ***

Using Raw Configure mode, please configure the below key for the data_engine probe.
It will help automatically fix mismatched ci_metric_ids.

Under the setup section, set:

update_metric_id = yes

When you check to make sure that update_metric_id is set to yes, make sure you cold start the data_engine as well after running the UPDATE statement.

Deactivate-Activate

Please see the below link for more information.

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/GA/monitoring/infrastructure-core-components/data-engine/data-engine-troubleshooting.html#concept.dita_3f32bea90d4ec90481cd5b2bccbb5935069a9a54_CorrelationIssueswithMetricData

If you still don’t have data showing up after resetting the ci_metric_id, then it’s time to examine the discovery_server side of things.

SELECT * FROM CM_CONFIGURATION_ITEM_METRIC WHERE ci_metric_id = '<met_id>';

If this query doesn't return data, then it's time to start checking the discovery_server for errors in the logs related to the robot that homes that probe and could be due to issues discussed in TOPIC a.