Smarts SAM: Notification Console receives large amounts of Card and Interface DOWN alarm notifications from Alcatel-Lucent devices all at the same time

Products

VMware

Issue/Introduction

Symptoms:

Smarts Service Assurance Manager (Smarts SAM) Notification Console receives large amounts of Card and Interface DOWN alarm notifications from Alcatel-Lucent devices all at the same time
All of the notifications from Alcatel-Lucent devices appear at the same time on the Smarts SAM Notification Console

All of the notifications from Alcatel-Lucent devices have the same Source IP-AM/PM Domain

All notifications appear at the same time from Alcatel-Lucent devices which are imported into IP-AM/PM from the Smarts Alcatel 5620 SAM Adapter (ASAM)

Smarts SAM Notification Console receives large amounts of Card and Interface DOWN alarm notifications from Alcatel-Lucent devices and all of the devices are already DOWN

Disconnect message similar to the following is found in the in the Smarts Alcatel 5620 SAM Adapter (ASAM) log file:

[June 21, 2012 12:23:21 AM GMT+01:00 +004ms] t@112 SM_SocketObserver-5 8575 InCharge-AM-PM Remote Accessor #1
CI-E-EWHILE-While executing function "readableNow"
CI-EFLOWID-For flow CI_FlowBufferedHead_U [observer for client 5 8575 InCharge-AM-PM Remote Accessor] HEAD|BUFFERED @0xffffffff0ad06580
. Read buffer, 0 bytes available of 2145
. ?3?2A2A2A2A2A2A2A2A 2A2A2A2A2A2A2A2A ^|2A00000000000000 000000002A2A2A2A
. Write buffer, 0 bytes written of 2048
. ?3?[^0000000100000040 4FE25B6900000009 534D5F5379737465 6D00000009534D2D
. ->CI_FlowAES_CBC_U [observer for client 5 8575 InCharge-AM-PM Remote Accessor] IN_FLOW|BLOCK @0xffffffff12ea8270
. ->CI_FlowTCP_U [observer for client 5 8575 InCharge-AM-PM Remote Accessor] IN_FLOW|PHYSICAL @0xfffffffdf4618760
. *:v4:31300 KS N/A, KR N/A
. Open fd=104, conn June 19, 2012 6:56:07 PM GMT+01:00, disc N/A,
. 192.168.0.98:31300 -> 192.168.0.104:45935, tmo 9344 09:18:01 N/S 1/1
CI-EWHILEREAD-After reading "0" bytes of "15" maximum
<SYS>-ECONNRESET-Connection reset by peer; in file "/work/blackcurrent/DMT-9.0.0.X/1330/smarts/clsapi/ci_flow.c" at line 2503

"<SYS>-ECONNRESET-Connection reset by peer" - indicated the connection was closed by the remote peer i.e the IP-AM-PM domain.

Environment

VMware Smart Assurance - SMARTS

Cause

Because Alcatel-Lucent devices in an IP-AM/PM topology get their Card and Interface Status attributes directly from the Smarts Alcatel 5620 SAM Adapter (ASAM) Adapter, a disconnect in the remote repository accessor between the IP-AM/PM repository and the ASAM repository causes the Interface/Card Status attribute of ALL Alcatel-Lucent devices in the IP-AM/PM topology to be set to UNKNOWN. When the connection is restored, the Card/Interface Status attribute is restored to its original value. Note the following about this behavior:

If the Card/Interface had a status of UP before the disconnect, it will be set back to UP. This does not result in any notifications being generated.
If the Card/Interface had a status of DOWN before the disconnect, it will be set back to DOWN. This will result in a DOWN re-notify for every Card or Interface that was previously down.

There are two connections made by the remote repository accessor from IP-AM/PM to ASAM. One is a regular two-way command/response connection, and the other is a logical one-way connection used to push subscription alerts to the IP-AM/PM. The "timeout" parameter applies to the two-way command/response connection, which is used to make certain remote-api calls. Timeouts are required because one end of the connection needs to know if the other end has somehow gone into a deadlock or unresponsive state. But, if the remote server is busy, possibly with discovery/reconfigure/post-processing or other operations where there can be a lot of contention for the repository lock, a higher timeout value is needed. Even though the two channels are separate, if one channel experiences an error (timeout), both connections get closed and reconnected, so increasing the timeout may be necessary. This requirement is more likely in an environment that has multiple IP-AM/PM domains subscribing to a single ASAM domain. Note that this timeout issue will not affect the "logical" subscriptions channel. Any status changes at ASAM will be instantaneously sent to the AM.

Resolution

You may avoid the disconnects described above by increasing the remote repository accessor timeout value in the IP-AM/PM bootstrap (see Note statement). The default value for this timeout is 30 (seconds). For environments that encounter this issue and need to increase the accessor timeout value, a recommended value is 300. This is done as follows:

Open the bootstrap-am-pm.conf configuration file in sm_edit:

sm_edit /local/conf/icf/bootstrap-am-pm.conf
Find the MR_RemoteReposInterface and change the "timeout" value to 300 as in the following:

MR_RemoteReposInterface::MR-RemoteReposInterface {
allow_on_demand_gets = TRUE
timeout = 300
observer_timeout = 600
debug = FALSE
}
Restart the IP-AM/PM to apply the change in the active environment. The accessor timeout value is only read on startup.

Additional Information

To diagnose the Card/Interface Status "DOWN -> UNKNOWN -> DOWN -> Re-Notify" behavior, the following sm_adapter property subscriptions to both the IP-AM/PM domain and the ASAM domains need to be analyzed:

Card:

1. sm_adpater -s <domain> -b <broker> --subscribe=Card::.*::.*/pae > <domain>-CardAlarmlist.log
2. sm_adapter -s <domain> -b <broker> --subscribeProp=Card::.*::Status > CardStatuslist.log
3. sm_adapter -s <domain> -b <broker> --subscribeProp=<Instrumentation Class Name>::.*::StatusFromPoll > <domain>-StatusFromPoll.log
4. sm_adapter -s <domain> -b <broker> --subscribeProp=<Instrumentation Class Name>::.*::StatusIsCriticalActive > <domain>-StatusIsCriticalActive.log

   Note: sm_adapter subscription to "StatusFromPoll" and "StatusIsCriticalActive" attributes require the name of a the card Instrumentation class.
   We first need to find out the Instrumentation class of the Cards that are producing the false alarms.
   You can get this using:

dmctl -s <Domain Name:NGNIP-APM2> -b <broker> get Card::<any Card instance having issue>::InstrumentedBy
this will return { <Instrumentation Class Name>::<Instrumentation Class Instance name>}

Interface:

1. sm_adpater -s <domain> -b <broker> --subscribe=Interface::.*::.*/pae > <domain>-AlarmlistInterface.log
2. sm_adapter -s <domain> -b <broker> --subscribeProp=Interface::.*::Status > <domain>-StatusInterface.log
3. sm_adapter -s <domain> -b <broker> --subscribeProp=Interface::.*::OperStatus > <domain>-OperStatusInterface.log

Let these subscriptions run until an occurance of the bulk Card/Interface Down alarms is observed.

The IP and ASAM adapter subscriptions were compared for the devices that re-notified.

The IP subscriptions show the Card/Interface Status changes from DOWN to UNKNOWN then back to DOWN wheras the ASAM subscriptions show the Card/Interface status stay constantly DOWN.