Alarms generated for servers in active maintenance intermittently in UIM 20.4 or higher
search cancel

Alarms generated for servers in active maintenance intermittently in UIM 20.4 or higher

book

Article ID: 240467

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

  • Intermittently, we are receiving false alarms for devices that are in active maintenance mode.

  • Alarms from devices correctly added to an active maintenance schedule are still coming through.

  • The alarm dev_id is listed in the schedule for suppression, but still, we are receiving some alarms unexpectedly.

Environment

  • Release: 20.4.* or higher
  • Component: UIM NAS
  • maintenance_mode
  • ems
  • nas 20.4 or higher

Cause

A possible cause of this issue can be that the maintenance_mode probe was not reachable at the time the issue occurred. Therefore the NAS was not able to register with maintenance_mode.

This caused the NAS to discard the maintenance schedule.

The NAS collects maintenance schedules from the maintenance_mode probe "at run time" and this resulted in an alarm 'leak' for that period.

Note also that the cause for maintenance_mode probe not reachable could be due to network or other DB issues.

Something that may indicate a DB issue could be verified in the maintenace_mode log as a registration failure:

Example: 

logs at Apr 12 00:30:16:588 WARN / SQLServerException

Exception started at:
Apr 12 00:30:16:588 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException: StatementCallback

Exception continued till 
Apr 12 00:52:03:871 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException

Resolution


A new key was introduced that may help overcome/workaround this issue in case of a similar scenario. 

The new parameter "maint_sched_discard" is available that lets you decide whether you want to discard the maintenance schedule.

You can specify the value as yes or no. A value of no implies that the maintenance schedule will be retained.

The value is found under the nas' setup section via raw configure.

The default is->

maint_sched_discard = yes

nas (Alarm Server) Release Notes

Setting it to no: 

   maint_sched_discard = no

The maintenance mode schedules won't be discarded if maintenance_mode is not reachable in a similar scenario.


Make the following adjustments to the nas and ems probes:

Open the nas probe in Raw Configure mode and set the following parameter under the <setup> section:

   maint_max_resp_time = 50
   
   registrationIntervalLookAheadMinutes = 60


Open the ems probe in Raw configure mode and set the following under 'setup':

   maintenance_mode_cmd_timeout = 300000


Adjust maintenance_mode Java heap memory

Change

<startup>
   options = -Xms512m -Xmx1024m
</startup>

to

<startup>
   options = -Xms1024m -Xmx2048m
</startup>

Additional Information

This change may also help resolve issues with robot_inactive alarms being generated as false alerts sent during maintenance schedules.

Lastly, on the Primary hub, set the 'start_after' parameter and value in the controller.cfg.

<alarm_enrichment>
   description = Alarm Enrichment Server
   group = Infrastructure
   active = yes
   type = daemon
   command = <startup java>
   arguments = -Djava.library.path="../../../../lib" -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar
   config = alarm_enrichment.cfg
   logfile = alarm_enrichment.log
   workdir = probes/service/nas/alarm_enrichment
   start_after = maintenance_mode
   magic_key = r1kIzNSjeF8I6TNS0HoPGWc2+qakLUaVEQJmD+LLgbnp+a7zhCJO7IKGoggaDvieil2aMe7rHQUYgodJVJJKbg==
</alarm_enrichment>