Following a manual failover to a backup SAM lots of notifications CLEAR
search cancel

Following a manual failover to a backup SAM lots of notifications CLEAR

book

Article ID: 315840

calendar_today

Updated On:

Products

VMware Smart Assurance Network Observability

Issue/Introduction

Use-Case:

  • Customers can invoke a manual failover using the ic-failover-server script.
  • This means that both Site A and B SAM domains are running and still processing notifications.  
  • It is important to note that the subscription between the two domains is still active and has not changed. Recall that the former Standby SAM  is still subscribed to the former Active SAM.  Because failover takes time "Promote" the newly made Active domain and "Demote" the newly made Standby. During this time the notification changes will flow in the same direction prior to invoking the manual failover.

Symptoms:

  • Following a manual failover all the active alarms in the newly made Standby SAM clear and these clears are also sent to the newly made Active SAM domain. When the newly made Active SAM gets the subscriptions to the underlying domains and its Standby counterpart set in place by Failover, the Promotion and Demotion of the SAM domains are complete.  The underlying domains still show the Active Notifications and these are reported to the Promoted SAM domain and the notifications report a DXA Notify and the last change time is updated.
  • This results in what looks like a notification flood.  
  • Users that have third party ticketing agents that act upon any new notification may experience issues when performing a manual failover through the ic-failover-server script.

Environment

All Supported Smarts releases

Cause

  • The failover of the primary SAM server happens with a Detach Sibling action:
    Detach sibling - is the action in which the sibling SAM domain gets disabled in the active SAM server 
    Example:
failover_actions: <Date-Time>; Failover from 'SAM-Primary' to 'SAM-Secondary' successful.
......
detach_sibling: <Date-Time>; Detach Sibling begin.
detach_sibling: <Date-Time>; Domain 'SAM-Primary', was successfully disabled in server 'SAM-Secondary
......
detach_sibling: <Date-Time>; 'SAM-Secondary' was successfully disabled in server 'SAM-Primary'.
detach_sibling: <Date-Time>; Detach Sibling complete.
  • The primary SAM audit shows these type of events:
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername-/<index>_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 21 SYSTEM CLEAR <AMPM Domain>: Domain Server Deleted.
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername-/<index>_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 22 DXA NOTIFY Server: SAM-Primary
  • In the above notification audit log; as a result of failover the domains in the old active server gets disabled and the notifications from those domains gets cleared; SYSTEM CLEAR AM domain-name: Domain Server Deleted. message indicates this;
    Since this SAM is still subscribed to its secondary SAM- the clear gets sent to the secondary SAM
    Later the new active sam - 1060vpap - SAM notifies it;
  • Secondary SAM audit shows this;
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername/233_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 38 DXA SOURCE-NOTIFY Server: SAM-Secondary
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername/233_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 39 DXA CLEAR Server: SAM-Secondary
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername/233_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 40 DXA NOTIFY Server: AMPM-Domain
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername/233_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 41 DXA CLEAR Server: AMPM-Domain
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername/233_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 42 SYSTEM ACKNOWLEDGE Auto-acknowledged
<epoch> <Date-Time> NOTIFICATION-Interface_FaultMIB2_I-InterfaceFault_MIB2-IF-servername/233_BackupActivated Interface_Fault_MIB2 I-Interface_Fault_MIB2-IF-servername/<index> BackupActivated 43 SYSTEM ARCHIVE Auto-archived
  • The failover actions log and the failover manager log will also report once the failover is complete that the formerly active SAM has been "Demoted" and the newly active SAM has been promoted.  
  • Review these two logs to confirm that the failover for SAM has completed and you can also confirm that the subscriptions have been updated.

Resolution

  • Smarts is working as designed and the explanation is as follows:
    1. Both SAM domains are still running and processing Notifications.
    2. Underlying Domains in the old active SAM are disabled - this clears the alarms in old active SAM- this old active SAM- feeds clear alarm to new active SAM until this SAM domain is demoted and the subscription between the two sites for the SAM domain changes direction.
    3. Domains get enabled in the newly promoted active SAM- so it fetches alarms from its active underlying domains; the cleared alarms now become active; The audit log will report a DXA Notify and the Last change time is updated.
  • So in the event of forced failover (manual failover) - this situation of clear and notify will happen in the new active SAM server.  In the event of failover where active SAM goes down - the clear would not happen, since the old active SAM will be unavailable and no notifications from it would flow from it to new active SAM.
  • If this is impacting your third party Ticketing Agent, the work around for this is to mimic the failover process by shutting down the Active SAM domain service and allowing the Failover Manager time to fully process and Promote the Standby to Active. 
  • Should you need to failover the broker as well do this first and make sure the broker is properly failed over the Standby broker is promoted to Active before proceeding with any domain shut downs.  
  • If you have a hierarchical SAM environment and you are wanting to take down all your active SAM Domains for a maintenance task, and you do not want to experience this notification clear and renotify by using the manual failover, it is recommended that you take one SAM domain at a time and confirm that the Standby is promoted to Active before continuing on with the next SAM domain.
  • Note: 
    • This can be a time consuming process if there are multiple layers of SAM domains involved.
    • This procedure applies to the SAM domains only.