Smarts IP: Alarm is reported on the wrong disk
search cancel

Smarts IP: Alarm is reported on the wrong disk

book

Article ID: 332216

calendar_today

Updated On:

Products

VMware Smart Assurance

Issue/Introduction

Symptoms:


NOTE: This article, and it's explanations relies heavily on the output from the open source utility SNMPWALK.  To find out how to use this utlity to probe the MIB data of a SNMP agent, see this Publicly available WIKI LINK

Problem Description: In testing, a customer found that if they pull a disk on a server, smarts reports that the wrong disk is pulled.  This article explains the diagnosis that went into understanding this symptom, and who we discovered that was a SNMP agent issue, not a smarts issue.


Environment

VMware Smart Assurance - SMARTS

Cause

Here's the SNMPWALK output for OID - .1.3.6.1.4.1.9.9.719.1.45.4.1

Here is the normal output with both disks inserted:

.1.3.6.1.4.1.9.9.719.1.45.4.1.2.1: sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-1
.1.3.6.1.4.1.9.9.719.1.45.4.1.2.2: sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-2
.1.3.6.1.4.1.9.9.719.1.45.4.1.3.1: pd-1
.1.3.6.1.4.1.9.9.719.1.45.4.1.3.2: pd-2
.1.3.6.1.4.1.9.9.719.1.45.4.1.4.1: 0
.1.3.6.1.4.1.9.9.719.1.45.4.1.4.2: 0
.1.3.6.1.4.1.9.9.719.1.45.4.1.5.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.5.2: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.6.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.6.2: 2
.1.3.6.1.4.1.9.9.719.1.45.4.1.7.1: HDD
.1.3.6.1.4.1.9.9.719.1.45.4.1.7.2: HDD
.1.3.6.1.4.1.9.9.719.1.45.4.1.8.1: 512
.1.3.6.1.4.1.9.9.719.1.45.4.1.8.2: 512
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.2: 1

You can see both Disk have status1:
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.2: 1

When we rebuilt the disks, you can also see disk 1 and 2. 2 is normal
.1.3.6.1.4.1.9.9.719.1.45.4.1.2.1: sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-1
.1.3.6.1.4.1.9.9.719.1.45.4.1.2.2: sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-2
.1.3.6.1.4.1.9.9.719.1.45.4.1.3.1: pd-1
.1.3.6.1.4.1.9.9.719.1.45.4.1.3.2: pd-2
.1.3.6.1.4.1.9.9.719.1.45.4.1.4.1: 0
.1.3.6.1.4.1.9.9.719.1.45.4.1.4.2: 0
.1.3.6.1.4.1.9.9.719.1.45.4.1.5.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.5.2: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.6.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.6.2: 2
.1.3.6.1.4.1.9.9.719.1.45.4.1.7.1: HDD
.1.3.6.1.4.1.9.9.719.1.45.4.1.7.2: HDD
.1.3.6.1.4.1.9.9.719.1.45.4.1.8.1: 512
.1.3.6.1.4.1.9.9.719.1.45.4.1.8.2: 512
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1: 3
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.2: 1

In the down file, only Disk 2 is there and normal.

.1.3.6.1.4.1.9.9.719.1.45.4.1.2.1: sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-2
.1.3.6.1.4.1.9.9.719.1.45.4.1.3.1: pd-2
.1.3.6.1.4.1.9.9.719.1.45.4.1.4.1: 0
.1.3.6.1.4.1.9.9.719.1.45.4.1.5.1: 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.6.1: 2
.1.3.6.1.4.1.9.9.719.1.45.4.1.7.1: HDD
.1.3.6.1.4.1.9.9.719.1.45.4.1.8.1: 512
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1: 1

Resolution

Analysis  of the above is as follows:

1. The host in question  does have the following disks equipped.

sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-1
sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-2

2. Based on the device certification and interaction with vendor, SMARTS does utilize the following OIDs to create the Disk instance.

cucsStorageLocalDiskPresence

{".1.3.6.1.4.1.9.9.719.1.45.4.1.10"}

cucsStorageLocalDiskDn

{".1.3.6.1.4.1.9.9.719.1.45.4.1.2"}

The OID "cucsStorageLocalDiskPresence" does let us know if the module is present or not.

So walking the above OID  using SNMPWALK (i.e cucsStorageLocalDiskPresence) we have :

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.10
.1.3.6.1.4.1.9.9.719.1.45.4.1.10.1 = 10
.1.3.6.1.4.1.9.9.719.1.45.4.1.10.2 = 10

There are two index 1 and 2 and both are having value 10 which means that the disk module is present.

Now lets see what we get for the respective index on the OID : "cucsStorageLocalDiskDn"

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.2
.1.3.6.1.4.1.9.9.719.1.45.4.1.2.1 = sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-1
.1.3.6.1.4.1.9.9.719.1.45.4.1.2.2 = sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-2

3. So SMARTS does go and creates the disk instances upon reading the above 2 MIB oids.

4. Now coming back to the monitoring part, the status of the disks are exposed through the following oid:

cucsStorageLocalDiskOperability = "1.3.6.1.4.1.9.9.719.1.45.4.1.9"

walking the above OID in a normal situation we have the following :

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.9
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1 = 1
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.2 = 1

The above output is telling us taht index 1 and index 2 have a  value  of 1 which means that the disk is "operable" as per MIB definition.

5. Then during the DOWN scenario (i.e when you pull disk 1), now we see what we have:

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.9
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1 = 1

NOTE: Now we see that the the index 2 is totally removed from the MIB and only the index 1 is present and hence only index 1 is being polled.

Right from discovery to monitoring, SMARTS does utilize a unique key (i.e the index) to identify an instance and based on this identify only the discovery and monitoring happens.

Now, when the index 1 disk is pulled out, the general universal expectation is to populate disk index status as either "inoperable", "degraded", "poweredOff", "powerProblem", "removed", "decomissioning" as directred by the MIB defitnition (cucsStorageLocalDiskOperability).

The exact information what would the device should expose is the disk index 1 has been removed. But in the current case, the index 2 has been completely removed and only index 1 related information is available.

Then all the information related to disk 2 has been shifted to index 1.

NOTE: We see the following walk when disk 1 is pulled out.

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.10 (cucsStorageLocalDiskPresence)
.1.3.6.1.4.1.9.9.719.1.45.4.1.10.1 = 10

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.2 (cucsStorageLocalDiskDn)
.1.3.6.1.4.1.9.9.719.1.45.4.1.2.1 = sys/rack-unit-1/board/storage-SAS-SLOT-3/pd-2

SNMP Walk MIB starting at .1.3.6.1.4.1.9.9.719.1.45.4.1.9 (cucsStorageLocalDiskOperability)
.1.3.6.1.4.1.9.9.719.1.45.4.1.9.1 = 1

Actually SMARTS is monitoring index 1 and not monitoring index 2 as the reference to index 2 has been removed from the MIB and all the information regarding to index 2 has been shifted to index 1 and hence the problem.

Summary conclusion: This is something what needs to be clarified from the vendor and SMARTS will not be able to detect this device abnormality and request vendor clarification.