Finding and replacing failed drives in JBOD enclosures

Products

Security Analytics

Issue/Introduction

There may be a report in the web UI of a disk failure. This can be indicated by a pink banner at the top of the page. Locating the failed disks and replacing them is simple using a few command line utilities.

Cause

Hard disks wear over time and need to be replaced.

Resolution

To find the failed SAS attached drive (including all internal drives) in the Dell MD1200/MD1400s and the Symantec J5300 external enclosures, run lsi-show as root from the command line. The Logical Drive Information should report a Status of Optimal if everything is healthy. It may show Degraded, Failed or even Offline. If the Logical Disk is not Optimal, a drive has failed.

This example is shows a healthy set of disk drives -

Logical Drive Information

Id Size Status Stripe Raid Level Cache

0 20.006 TB Optimal 64 KB Primary-5 WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU

by id: 22,21,13,16,15,14,19,18,17,20,23,24

by es: 25:0,25:1,25:2,25:3,25:4,25:5,25:6,25:7,25:8,25:9,25:10,25:11

The drive might say "Failed" if it is a hard failure. Each drive should be either "Online, Spun Up"

0 32:0 Online, Spun Up None 1.819 TB SEAGATE ST2000NX0453 <SERIAL_NUMBER>

OR Unconfigured(good), Spun Up

14 32:14 Unconfigured(good), Spun Up None 931.512 GB SEAGATE ST91000640SS <SERIAL_NUMBER>

To determine if a drive will fail, the system might advertise it as "Predictive Failure". To find these type of failures run (as root) from the CLI: grep HARDWARE /var/log/messages | grep Slot

A sample entry for a Predictive Failure would look like:

Oct 15 19:40:01 sensor_name_here disk_subsystem[28117]: snlog: sn="##:##:##:##:##:##" id="DS" m="23" c="6" event="DISK_STATUS" category="HARDWARE" ip="##.##.##.##" model="R730xd" msg="Adapter 0; seqNum: 0x00031be2; Time: Mon Oct 15 20:38:19 2018; Event Description: Predictive failure: PD 0a(e0x20/s10); Device ID: 10; Enclosure Index: 32; Slot Number: 10; "

Be sure to get email or syslog notifications of hardware failures by enabling them in the Web UI in Settings -> Communications. Set the syslog and email servers for your site. Under the Advanced tab, select the Hardware check boxes for Syslog or Email to fit your environment.

To identify the enclosure and the slot the drive is in, start with the serial number of the enclosure. The Dell enclosures have the asset tag on the right mounting tab. The J5300s have a white sticker on the right mounting tab.

The Dell enclosures have a drive slot map on the right mounting tab. The J5300s are numbered 1-12 starting with the top left corner being one. The drive to the right of slot one is two and the drive in the second row, first column is five. Follow that pattern, where the last drive, 12, is on the bottom row, far right column.

There is a command to light the locator LED on the drives. The command is megacli -pdlocate -start|-stop -physdrv[E:S] -aX. For example to find the drive on adapter 2, enclosure 32, slot 3, run megacli -pdlocate -start -physdrv[32:3] -a2. Sometimes the drive has failed completely and the LED will not light. A workaround is to try locating the drive next to it. For example, megacli -pdlocate -start -physdrv[32:4] -a2. To stop the flashing light, after the drive has been swapped, run megacli -pdlocate -stop -physdrv[32:3] -a2.

To replace the drive once it is identified, remove the old drive and insert the new. Do not leave the slot open without a drive for more than 10 minutes .

There is a separate article for finding disk failures for attached storage arrays. These would be the Dell MD3860, Netapp E5660, and Dell ME4/VA4 models. The commands and drive locations will be different.

Additional Information

For help with larger storage arrays, see: Disk drive failed in the Security Analytics storage array