Alarms generated for servers in active maintenance intermittently in UIM 20.4.x

search cancel

Alarms generated for servers in active maintenance intermittently in UIM 20.4.x

book

Article ID: 240467

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

Intermittently, we are receiving false alarms for devices that are in active maintenance mode.
Alarms from devices correctly added to an active maintenance schedule are still coming through.
The alarm dev_id is listed in the schedule for suppression, but still, we are receiving some alarms unexpectedly.

Environment

Release: 20.4.* or higher
Component: UIM NAS
maintenance_mode

Cause

A possible cause of this issue can be that the maintenance_mode probe was not reachable at the time the issue occurred. Therefore the NAS was not able to register with maintenance_mode.

This caused the NAS to discard the maintenance schedule.

The NAS collects maintenance schedules from the maintenance_mode probe "at run time" and this resulted in an alarm 'leak' for that period.

Note also that the cause for maintenance_mode probe not reachable could be due to network or other DB issues.

Something that may indicate a DB issue could be verified in the maintenace_mode log as a registration failure:

Example:

logs at Apr 12 00:30:16:588 WARN / SQLServerException

Exception started at:
Apr 12 00:30:16:588 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException: StatementCallback

Exception continued till 
Apr 12 00:52:03:871 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException

Resolution

A new key was introduced that may help overcome/workaround this issue in case of a similar scenario.

The new parameter "maint_sched_discard" is available that lets you decide whether you want to discard the maintenance schedule.

You can specify the value as yes or no. A value of no implies that the maintenance schedule will be retained.

The value is found under the nas' setup section via raw configure.

The default is->

maint_sched_discard = yes

nas (Alarm Server) Release Notes

Setting it to no:

maint_sched_discard = no

The maintenance mode schedules won't be discarded if maintenance_mode is not reachable in a similar scenario.

Make the following adjustments to the nas and ems probes:

Open the nas probe in Raw Configure mode and set the following parameter under the <setup> section:

maint_max_resp_time = 50

registrationIntervalLookAheadMinutes = 60

Run raw configure mode for the ems probe and set the following under 'setup':

maintenance_mode_cmd_timeout = 300000

Additional Information

The new parameter requires a minimum NAS version 9.32 but nas 9.32HF1 is recommended.

This may also help resolve issues with robot_inactive alarms being generated as false alerts sent during maintenance schedules.

Note also that Informational and clear alarm messages are still displayed during a maintenance window.

Feedback

thumb_up Yes

thumb_down No