Troubleshoot SLAs/SLOs in UIM
search cancel

Troubleshoot SLAs/SLOs in UIM

book

Article ID: 418482

calendar_today

Updated On:

Products

CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Guidance on how to troubleshoot SLAs/SLOs in UIM.

Environment

  • DX UIM - Any version
  • sla_engine

Resolution

1. Examine/verify SLA configuration

2. Examine/verify each SLO that comprise the SLA (Check Source, Target, QOS) for changes/validity

3. Confirm that the QOS Data exists in the database and is currently being generated by any related probe(s).

4. Confirm the data is being successfully inserted into the database by the data_engine (data_engine.log), and the data is current.

5. Examine the data samplevalues in the SLM webapp (tabular view mode versus graphic), to make sure there are no nulls.
    If occasional nulls, you can choose to ignore them via the SLA definition.

    - when a probe starts up it should send a QOS_DEFINITION message for all the QoS it can collect.
    - This results in a new entry being inserted into S_QOS_DEFINITION for the given QoS type
    - The first time a metric, which matches that QOS definition hits the data_engine, the appropriate RN_QOS_DATA_XXXX (raw data) table
       will be created and populated. Therefore, if we are missing any RN_QOS_DATA tables, this would indicate that either:

a. data from a specific QOS metric is not being collected or the data is queued

b. data_engine may be having trouble inserting the data or creating the table.

In that case,

6.  Set data_engine to loglevel 5. Set logsize to 50000. Restart the data_engine

7.  Open Tools->DrNimbus

8.  Restart whatever monitoring probe is the problematic one and check any others related to the defined SLOs

9.  Watch DrNimbus using the message sniffer and you should see a QOS_DEFINITION message coming from the probe with the QoS you are looking for (as per the SLOs)

10. Watch DrNimbus longer (one polling interval) and you should see QOS_MESSAGE messages coming from the same probe with the QoS data you are looking for in the SLOs

11. Examine the data_engine log when this comes in and look for any INSERT errors or other data errors/issues.

      The RN_QOS_DATA_#### numbers should match up with QOS_DEF_ID in the S_QOS_DEFINITION table.

      So for example, RN_QOS_DATA_0149 should be associated with qos_def_id 49. You can use this to figure out which probes/metrics are        the ones you need to focus on.

12. Make sure the data is making its way from the probe->Robot->Hub->NimBUS->Hub/Primary Hub->Database.

      Check all related logs at loglevel 5, logsize 50000, and check for communication issues as well as any data issues.

Logs to examine:

 sla_engine
 _sla_engine
 <probe_name>.log
 controller.log
 hub.log
 data_engine.log
 _data_engine.log
 wasp.log
operatorconsole.log


Helpful queries

select * from s_sla_definition
select * from s_slo_definition
select * from d_sla_jobs
select * from d_slo_compliance

Example query to examine SLO compliance values for a specific SLA job. You can check the
sla_engine logs for the given sla/job id values.

DECLARE @job_id as int;
DECLARE @sla_id as int;
SET @job_id= <11042729>;
SET @sla_id=<283>
SELECT AVG(percentage) AS pct FROM D_SLO_COMPLIANCE WHERE job_id=@job_id AND
sla_id=@sla_id
select * from S_SLA_DEFINITION;
select * from S_SLO_DEFINITION order by sla_id,slo_id;
select * from D_SLA_JOBS order by execute_date, sla_id;
select * from D_SLA_COMPLIANCE order by sla_id;
select * from D_SLO_COMPLIANCE order by sla_id, slo_id;
select * from D_QOS_COMPLIANCE order by sla_id, slo_id;
select * from H_SLA_COMPLIANCE order by sla_id;
select * from H_SLO_COMPLIANCE order by sla_id, slo_id;
select * from H_QOS_COMPLIANCE order by sla_id, slo_id, period_begin