Alarms generated for Servers in Active Maintenance intermittently in UIM 20.4.x
search cancel

Alarms generated for Servers in Active Maintenance intermittently in UIM 20.4.x

book

Article ID: 240467

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Intermittently, we are receiving alarms for devices that are in active maintenance mode.

Alarms from devices correctly added to an active maintenance schedule are still coming through.

The alarm dev_id is listed in the schedule for suppression, but still, we are receiving some alarms unexpectedly.

Environment

  • Release: 20.4.*
  • Component: UIM NAS

Cause

A possible cause of this issue can be that the maintenance_mode probe was not reachable at the time the issue occurred. Therefore the NAS was not able to register with maintenance_mode.

This caused the NAS to discard the maintenance schedule.

The NAS collects maintenance schedules from the maintenance_mode probe "at run time" and this resulted in an alarm leak for that period.

The cause for maintenance_mode probe not reachable could be due to network or other DB issues.

Something that may indicate a DB issue could be verified in the maintenace_mode log:

maintence_mode register failure: 

Example: 

logs at Apr 12 00:30:16:588 WARN / SQLServerException

Exception started at:
Apr 12 00:30:16:588 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException: StatementCallback

Exception continued till 
Apr 12 00:52:03:871 WARN  [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException

Resolution

A new key was introduced that may help overcome/workaround this issue in case of a similar scenario. 

The new parameter "maint_sched_discard" is available that lets you decide whether you want to discard the maintenance schedule.

You can specify the value as yes or no. A value of no implies that the maintenance schedule will be retained.

The value is found under the nas' setup section via raw configure.

The default is->

maint_sched_discard = yes

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/GA/monitoring/infrastructure-core-components/nas-alarm-server/nas-alarm-server-release-notes.html

Setting it to no: 

maint_sched_discard = no

The maintenance mode schedules won't be discarded if maintenance_mode is not reachable in a similar scenario. 

Make the following adjustments to the nas and ems probes:

Run raw configure on the nas probe and set the following under 'setup':

   maint_max_resp_time = 50
   registrationIntervalLookAheadMinutes = 60

Run raw configure on the ems probe and set the following under 'setup':

   maintenance_mode_cmd_timeout = 300000

Additional Information

The new parameter requires a minimum NAS version 9.32 but nas 9.32HF1 is recommended.