Smarts IP: What is the difference between an Unresponsive alert and a Down alert in Smarts IP Manager?

Products

VMware

Issue/Introduction

What is the difference between an Unresponsive alert and a Down alert in Smarts IP Manager?

Environment

VMware Smart Assurance - SMARTS

Resolution

In Smarts IP, an Unresponsive alert represents an event while a Down alert represents a problem. An event evaluation depends on attribute values presented, while a problem evaluation is based on the correlation logic that decides whether the system should be active or not. Whether something represents a "problem" is decided based on the symptoms active at the time when the correlator runs.

Additional Information

Unresponsive:
************
When a system is unresponsive, its because its not responding when the device is queried. Depending on the AccessMode attribute value, the logic of deciding unresponsive varies.

ICMPONLY: No response for ICMP.
SNMPONLY: No response for SNMP request.
ICMPSNMP: No response for ICMP and SNMP requests.

In all three cases, all the IP addresses in the Smarts IP AddressList are attempted. If the device responds for one of the IPs in the AddressList, then IsUnresponsive is set to false. DOWN:

DOWN:
*********
Down is a problem. Smarts IP shows a device down based on the symptom set available at a specific instance. A symptom set consists of many events. Smarts uses a separate correlation algorithm to compute and conclude a device to be down. The computation is based on the symptom set available.
Consider an example where a device was completely down for 5 days. System Unresponsive and System Down notifications were generated as follows:

18-Aug-2012 12:59:49 GMT NOTIFICATION-Firewall _Unresponsive NOTIFY
18-Aug-2012 13:03:42 GMT NOTIFICATION-Firewall _Down NOTIFY
18-Aug-201213:07:16 GMT NOTIFICATION-Firewall_ Down CLEAR
23-Aug-2012 13:52:04 GMT NOTIFICATION-Firewall _Unresponsive CLEAR

The System Unresponsive reflects the true outage duration. The System Down lasted for < 4 minutes. Why didn't the System Down persist for the same time period as the System Unresponsive?

Since the correlation computation happens at every unit time, the new correlation might detect the relaying device to be "DOWN" from "MightBeDown" and thus clearing the "DOWN" message from the device which was "DOWN" earlier. Also, certain systems are "UNRESPONSIVE" based on attributes set. The System "DOWN" event lasts for less than 4 minutes because within that 4 minutes, there might be some other relaying device reporting as "MightBeDown" and this in turn can be considered for the new correlation calculation. Because Smarts correlates for every unit interval of time, during the correlation, with the present set of symptoms at the time of correlation calculation, it calculates certain systems as "DOWN".

In Addition Smarts defines an Unresponsive alert and a down alert using the following logic:

A Down event in Smarts IP is essentially calculated as follows:

if IsUnresponsive &&
(IsAnyNeighborRelayDeviceResponsive || HasNoPartition);

An Unresponsive event is determined based on the results of IsUnresponsive calculation, which is calculated as follows:

IsUnresponsive=TRUE - Indicates that the system is not responding to ICMP pings or SNMP polls.
IsUnresponsive=FALSE - Indicates that the system is responding to ICMP pings and/or SNMP polls.
The logic for determining the TRUE/FALSE condition of IsUnresponsive is as follows:

IsUnresponsive = (IsManaged &&
(IsEveryIPUnresponsive &&
IsEveryIPv6Unresponsive &&
(HasFaultInstrumentedSAP || AccessMode == SNMPONLY)) &&
(IsEveryServiceUnresponsive &&
(HasFaultInstrumentedService || AccessMode == ICMPONLY))
) else FALSE;

There are some computed attributes which define IsUnresponsive, but the first few can be directly seen through the console.

NOTE: For a down event to be computed, the device goes through the following transitions:

Device is:
1: Unresponsive (Not a Mandatory)
2: MightBeDown
3: DOWN

Unresponsive ---> when all the IP address and SNMP agent are unresponsive.

MightBeDown ---> When its Unresponsive and if it has at least one responding neighbor (i.e. IsEveryNeighborUnresponsive set to false).

DOWN ---> When its MightBedown and NeighboringSystemsMightBeDown symptom and other symptoms (connected neighbours status) are checked to deduce DOWN.