Identify and replace a hard drive that has generated alarms for hard drive error and broken RAID on CPM

book

Article ID: 167783

calendar_today

Updated On:

Products

XOS

Issue/Introduction

How To: For an CPM, identify and replace a hard drive that has generated alarms for hard drive error and broken RAIDWhen running "show module status cp<#>", you will see the status of the hard drive(s) and RAID status.  In the example below, RAID is broken and the Second Hard Drive has a critical error:
 

CBS# show module status cp2
 
 Slot 14:
  Board Type                   CP8600
[truncated for brevity]        
  Hard Disk                    250(GB)             
  Second Hard Disk             250(GB)             
  Flash                        NA                  
  Hard Drive Error             None                
  Second Hard Drive Error      Critical Error <<<<<<<<<<<<<<<<<
  RAID Status                  Broken, RAID 1 <<<<<<<<<<<<<<<<<

Cause

This procedure can be used to identify and replace a bad hard drive on the related CPM generating the alarms.

Resolution

When running "show module status cp<#>", you will see the status of the hard drive(s) and RAID status.  In the example below, RAID is broken and the Second Hard Drive has a critical error:

 

 

CBS# show module status cp2
 
 Slot 14:
  Board Type                   CP8600

 

 

 .
 .
 .              

  Hard Disk                    250(GB)             
  Second Hard Disk             250(GB)             
  Flash                        NA                  
  Hard Drive Error             None                
  Second Hard Drive Error      Critical Error <<<<<<<<<<<<<<<<<
  RAID Status                  Broken, RAID 1 <<<<<<<<<<<<<<<<<

 

1.  First, check and identify if both drives are recognized.  You should see two SCSI Host entries, representing 2 hard drives are present and accessible.  (If only one line is shown, then module will need to be reset and start with step 1. again to ensure both drives are seen.  Refer to KB# 4524 for more information on Broken RAID-1 recovery if there is no hard drive error generated as that is more suitable solution). 

Enter the "unix" prompt" and look at the /proc/scsi/scsi file:

 

 

[[email protected] soporte_ebtel]# cat /proc/scsi/scsi

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: ATA Model: Hitachi HTE72505 Rev: PC4O
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
Vendor: ATA Model: Hitachi HTE72505 Rev: PC4O
Type: Direct-Access ANSI SCSI revision: 05

 

 

2.  Check RAID status and identify the missing disk in the RAID set . Output below show issue with SATA2 sdb1[2](F):


[[email protected]]# cat /proc/mdstat

Personalities : [raid0] [raid1]
md1 : active raid1 sdb1[2](F) sda1[0]
104320 blocks [2/1] [U_]      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

md5 : active raid1 sdb5[2](F) sda5[0]
8008256 blocks [2/1] [U_]     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

md6 : active raid1 sdb6[2](F) sda6[0]
236002752 blocks [2/1] [U_]
   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

 

 

 

 

 

 

3.  Run the smartclt test on the "sdb" disk and record any errors.


[email protected]]# smartctl -a /dev/sdb

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more 
'-T permissive' options

 

 

5.  Remove the affected CPM , insert a new blank disk in the same slot as the failed drive and re-insert the CPM. 

6. When the module boots up, run the following command from the Linux shell:

[email protected]]# /crossbeam/bin/xos-raid-add
The RAID reconstruction might take several hours.

7.  Next, when running "show module status cp<#>", you will see the status of the hard drive(s) and RAID status.  In the example below, RAID status is fixed and the Second Hard Drive error has been cleared.

 
POD15# show module status cp2

 

 

 

 

 Slot 14:
  Board Type                   CP8600

 

 

[truncated for brevity]         
  Hard Disk                    250(GB)             
  Second Hard Disk             250(GB)             
  Flash                        NA                  
  Hard Drive Error             None                
  Second Hard Drive Error      None           <<<<<<<<<<<<<<<<<
  RAID Status                  Active, RAID 1 <<<<<<<<<<<<<<<<<
 
 

SCREEN SHOOT OF CPM

Attachments