Intermittently, we are receiving false alarms for devices that are in active maintenance mode.
Alarms from devices correctly added to an active maintenance schedule are still coming through.
The alarm dev_id is listed in the schedule for suppression, but still, we are receiving some alarms unexpectedly.
A possible cause of this issue can be that the maintenance_mode probe was not reachable at the time the issue occurred. Therefore the NAS was not able to register with maintenance_mode.
This caused the NAS to discard the maintenance schedule.
The NAS collects maintenance schedules from the maintenance_mode probe "at run time" and this resulted in an alarm 'leak' for that period.
Note also that the cause for maintenance_mode probe not reachable could be due to network or other DB issues.
Something that may indicate a DB issue could be verified in the maintenace_mode log as a registration failure:
Example:
logs at Apr 12 00:30:16:588 WARN / SQLServerException
Exception started at:
Apr 12 00:30:16:588 WARN [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException: StatementCallback
Exception continued till
Apr 12 00:52:03:871 WARN [attach_socket, com.nimsoft.monitor.probe.MaintenanceModeProbe] Failure registering to maintenance_mode. org.springframework.dao.DataAccessResourceFailureException
A new key was introduced that may help overcome/workaround this issue in case of a similar scenario.
The new parameter "maint_sched_discard" is available that lets you decide whether you want to discard the maintenance schedule.
You can specify the value as yes or no. A value of no implies that the maintenance schedule will be retained.
The value is found under the nas' setup section via raw configure.
The default is->
maint_sched_discard = yes
nas (Alarm Server) Release Notes
Setting it to no:
maint_sched_discard = no
The maintenance mode schedules won't be discarded if maintenance_mode is not reachable in a similar scenario.
Make the following adjustments to the nas and ems probes:
Open the nas probe in Raw Configure mode and set the following parameter under the <setup> section:
maint_max_resp_time = 50
registrationIntervalLookAheadMinutes = 60
Open the ems probe in Raw configure mode and set the following under 'setup':
maintenance_mode_cmd_timeout = 300000
Adjust maintenance_mode Java heap memory
Change
<startup>
options = -Xms512m -Xmx1024m
</startup>
to
<startup>
options = -Xms1024m -Xmx2048m
</startup>
This change may also help resolve issues with robot_inactive alarms being generated as false alerts sent during maintenance schedules.
Lastly, on the Primary hub, set the 'start_after' parameter and value in the controller.cfg.
<alarm_enrichment>
description = Alarm Enrichment Server
group = Infrastructure
active = yes
type = daemon
command = <startup java>
arguments = -Djava.library.path="../../../../lib" -Dfile.encoding=UTF-8 -jar ../lib/alarm_enrichment.jar
config = alarm_enrichment.cfg
logfile = alarm_enrichment.log
workdir = probes/service/nas/alarm_enrichment
start_after = maintenance_mode
magic_key = r1kIzNSjeF8I6TNS0HoPGWc2+qakLUaVEQJmD+LLgbnp+a7zhCJO7IKGoggaDvieil2aMe7rHQUYgodJVJJKbg==
</alarm_enrichment>