Alarms generated for servers in active maintenance intermittently in UIM 20.4.x
search cancel

Alarms generated for servers in active maintenance intermittently in UIM 20.4.x

book

Article ID: 240467

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

  • Intermittently, we are receiving alarms for devices that are in active maintenance mode.
  • Alarms from devices correctly added to an active maintenance schedule are still coming through.
  • The alarm dev_id is listed in the schedule for suppression, but still, we are receiving some alarms unexpectedly.

Environment

  • Release: 20.4.*
  • Component: UIM NAS
  • maintenance_mode

Cause

A possible cause of this issue can be that the maintenance_mode probe was not reachable at the time the issue occurred. Therefore the NAS was not able to register with maintenance_mode.

This caused the NAS to discard the maintenance schedule.

The NAS collects maintenance schedules from the maintenance_mode probe "at run time" and this resulted in an alarm leak for that period.

The cause for maintenance_mode probe not reachable could be due to network or other DB issues.

Something that may indicate a DB issue could be verified in the maintenace_mode log:

maintence_mode register failure: 

Example: 

logs at Apr 12 00:30:16:588 WARN / SQLServerException

Exception started at:
Apr 12 00:30:16:588 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException: StatementCallback

Exception continued till 
Apr 12 00:52:03:871 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException

Resolution

A new key was introduced that may help overcome/workaround this issue in case of a similar scenario. 

The new parameter "maint_sched_discard" is available that lets you decide whether you want to discard the maintenance schedule.

You can specify the value as yes or no. A value of no implies that the maintenance schedule will be retained.

The value is found under the nas' setup section via raw configure.

The default is->

maint_sched_discard = yes

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/GA/monitoring/infrastructure-core-components/nas-alarm-server/nas-alarm-server-release-notes.html

Setting it to no: 

maint_sched_discard = no

The maintenance mode schedules won't be discarded if maintenance_mode is not reachable in a similar scenario. 

Make the following adjustments to the nas and ems probes:

Run raw configure on the nas probe and set the following under 'setup':

   maint_max_resp_time = 50
   registrationIntervalLookAheadMinutes = 60

Run raw configure mode for the ems probe and set the following under 'setup':

   maintenance_mode_cmd_timeout = 300000

Additional Information

The new parameter requires a minimum NAS version 9.32 but nas 9.32HF1 is recommended.

This may also help resolve issues with robot_inactive alarms being generated and sent during maintenance schedules.

Note also that Informational and clear alarm messages are still displayed during a maintenance window.