Identify and replace a hard drive that has generated alarms for hard drive error and broken RAID on APM

book

Article ID: 167777

calendar_today

Updated On:

Products

XOS

Issue/Introduction

How To: For an APM, identify and replace a hard drive that has generated alarms for hard drive error and broken RAIDFirst, note the configuration for the vap-group has RAID-1 configured:
vap-group ips xslinux_v5_64
  raid 1                                     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<RAID-1 configured
  vap-count 2
  max-load-count 2
  ap-list ap7 ap8
  load-balance-vap-list 1 2 3 4 5 6 7 8 9 10
  ip-forwarding
  ip-flow-rule sf_lb
    action load-balance
    activate

 


Next, when running "show module status ap<#>", you will see the status of the hard drive(s) and RAID status.  In the example below, RAID is broken and the Second Hard Drive has a critical error: 
  Board Type                   AP8650             
[truncated for brevity]     
  Hard Disk                    250(GB)            
  Second Hard Disk             250(GB)            
  Flash                        NA                 
  Hard Drive Error             None               
  Second Hard Drive Error      Critical Error <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  RAID Status                  Broken, RAID 1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

       
 

Reviewing the /var/log/messages shows a failed SATA2 (SCSCI ID 3 / sdb) below:
 
Oct 16 03:32:57 ips_2 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 16 03:32:57 ips_2 kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Oct 16 03:32:57 ips_2 kernel:          res 40/00:00:00:4f:c2/00:01:00:00:00/00 Emask 0x4 (timeout)
Oct 16 03:32:57 ips_2 kernel: ata4.00: status: { DRDY }
Oct 16 03:32:57 ips_2 kernel: ata4: hard resetting link
Oct 16 03:33:02 ips_2 kernel: ata4: link is slow to respond, please be patient (ready=0)
Oct 16 03:33:07 ips_2 kernel: ata4: COMRESET failed (errno=-16)
Oct 16 03:33:07 ips_2 kernel: ata4: hard resetting link
Oct 16 03:33:13 ips_2 kernel: ata4: link is slow to respond, please be patient (ready=0)
Oct 16 03:33:17 ips_2 kernel: ata4: COMRESET failed (errno=-16)
Oct 16 03:33:18 ips_2 kernel: ata4: hard resetting link
Oct 16 03:33:23 ips_2 kernel: ata4: link is slow to respond, please be patient (ready=0)
Oct 16 03:33:52 ips_2 kernel: ata4: COMRESET failed (errno=-16)
Oct 16 03:33:52 ips_2 kernel: ata4: limiting SATA link speed to 1.5 Gbps
Oct 16 03:33:52 ips_2 kernel: ata4: hard resetting link
Oct 16 03:33:57 ips_2 kernel: ata4: COMRESET failed (errno=-16)
Oct 16 03:33:57 ips_2 kernel: ata4: reset failed, giving up
Oct 16 03:33:57 ips_2 kernel: ata4.00: disabled
Oct 16 03:33:57 ips_2 kernel: sd 3:0:0:0: timing out command, waited 30s
Oct 16 03:33:57 ips_2 kernel: ata4: EH complete
Oct 16 03:33:57 ips_2 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
Oct 16 03:33:57 ips_2 kernel: end_request: I/O error, dev sdb, sector 479893439
Oct 16 03:33:57 ips_2 kernel: raid1: Disk failure on sdb1, disabling device. 
Oct 16 03:33:57 ips_2 kernel:         Operation continuing on 1 devices
Oct 16 03:33:57 ips_2 kernel: RAID1 conf printout:
Oct 16 03:33:57 ips_2 kernel:  --- wd:1 rd:2
Oct 16 03:33:57 ips_2 kernel:  disk 0, wo:0, o:1, dev:sda1
Oct 16 03:33:57 ips_2 kernel:  disk 1, wo:1, o:0, dev:sdb1
Oct 16 03:33:57 ips_2 kernel: RAID1 conf printout:
Oct 16 03:33:57 ips_2 kernel:  --- wd:1 rd:2
Oct 16 03:33:57 ips_2 kernel:  disk 0, wo:0, o:1, dev:sda1
Oct 16 03:33:57 ips_2 /crossbeam/bin/cbs_md_check: MdActive = 3
Oct 16 03:33:57 ips_2 /crossbeam/bin/cbs_md_check: MdBroken = 1
Oct 16 03:35:31 POD22 cbshmonitord[4824]: [N] [POD22 1.2.1.20] Violation (s=1, alarm) occurred 3 times: module:10, item:2610 (H_ID_RAID_STATUS), time:"Wed Oct 16 03:34:26 2013", value: 33, norm:0-32, minor:0-48, major:0-48
Oct 16 03:35:31 POD22 cbsalarmlogrd: AlarmID 111401 | Wed Oct 16 03:35:31 2013 | minor | ap8 | raidStatusChange | RAID status change

Cause

This procedure can be used to identify and replace a bad hard drive on the related APM generating the alarms.

Resolution

As noted in the /var/log/messages, we know "sdb" was generating the I/O errors, and caused broken RAID-1 status for "sdb1":

Oct 16 03:33:57 ips_2 kernel: end_request: I/O error, dev sdb, sector 479893439
Oct 16 03:33:57 ips_2 kernel: raid1: Disk failure on sdb1, disabling device.



1.  First, check and identify if both drives are recognized.  You should see two SCSI Host entries, representing 2 hard drives are present and accessible.  (If only one line is shown, then module will need to be reset and start with step 1. again to ensure both drives are seen).


ips_2 (lab01): ~# cat /proc/scsi/scsi                                                              

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: Hitachi HTE72322 Rev: FCDO
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: Hitachi HTE72322 Rev: FCDO
  Type:   Direct-Access                    ANSI SCSI revision: 05

                
 

2.  You can also check RAID status and identify the missing disk in the RAID set.

ips_2 (POD22): ~# cat /proc/mdstat  
                                    
Personalities : [raid1]
md1 : active raid1 sdb1[2](F) sda1[0]
     239946688 blocks [2/1] [U_] <<<<<<<< represents that RAID-1 status is broken since the output is NOT "UU", but "U_".
     
md5 : active raid1 sdb5[1] sda5[0]
      2104448 blocks [2/2] [UU]  <<<<<<<< represents that RAID-1 status is complete since the output is "UU".

     
md6 : active raid1 sdb6[1] sda6[0]
      2104448 blocks [2/2] [UU]  <<<<<<<< represents that RAID-1 status is complete since the output is "UU".
     
unused devices: <none>


3.   Run the smartclt test on the "sdb" disk and record any errors.

ips_2 (lab01): ~# smartctl -a /dev/sdb                                                             !
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.   
 
 
4. Record the serial # of the failed disk in this case SAT2 [ HTE7220080623DP1C50DJGNW60T] 

test_1 (POD22): ~# /crossbeam/bin/cbs_scsi_disk_info.pl                   
=====================================================================================================
H C I L     /dev/ 
O H D U     device
S A   N   
T N         Gen  Dev  Serial#              Vendor     Model#           Rev                            
-----------------------------------------------------------------------------------------------------
0:0:0:0     sg0  sda  HTE7220081201DP0C70DSGVXBEC ATA        Hitachi HTE72201 DCCO                        
3:0:0:0     sg1  sdb  HTE7220080623DP1C50DJGNW60T ATA        Hitachi HTE72201 DCCO      <<<<<<<<<<<<<<<<<<<<<<
======================================================================================


5. Remove the affected APM , insert a new blank disk in the same slot as the failed drive and re-insert the APM. Please make sure to remove the faulty disk with the serial number recorded in the previous step.
     
 When a new disk is added to the APM, dmesg and /var/log/messages show RAID reconstruction
:

md: Autodetecting RAID arrays.
md: invalid raid superblock magic on sdb1
md: sdb1 has invalid sb, not importing!
md: autorun ...
md: considering sdb6 ...
md:  adding sdb6 ...
md: sdb5 has different UUID to sdb6
md: created md6
md: bind<sdb6>
md: running: <sdb6>
md: md6: raid array is not clean -- starting background reconstruction <<<<<<<<<<<<<<<<<<<<<
md: raid1 personality registered for level 1
raid1: raid set md6 active with 1 out of 2 mirrors
md: considering sdb5 ...
md:  adding sdb5 ...
md: created md5
md: bind<sdb5>
md: running: <sdb5>
md: md5: raid array is not clean -- starting background reconstruction
raid1: raid set md5 active with 1 out of 2 mirrors
md: ... autorun DONE.

Oct 16 12:10:09 test_1 kernel: md: md1: raid array is not clean -- starting background reconstruction
Oct 16 12:10:09 test_1 kernel: raid1: raid set md1 active with 2 out of 2 mirrors
Oct 16 12:10:10 test_1 kernel: md: md5: raid array is not clean -- starting background reconstruction
Oct 16 12:10:10 test_1 kernel: raid1: raid set md5 active with 2 out of 2 mirrors
Oct 16 12:10:12 test_1 kernel: md: md6: raid array is not clean -- starting background reconstruction
Oct 16 12:10:12 test_1 kernel: raid1: raid set md6 active with 2 out of 2 mirrors 



Next, when running "show module status ap<#>", you will see the status of the hard drive(s) and RAID status.  In the example below, RAID status is fixed and the Second Hard Drive error has been cleared.

 
  Board Type                   AP8650             
 [truncated for brevity]    
  Hard Disk                    120(GB)            
  Second Hard Disk             120(GB)            
  Flash                        NA                 
  Hard Drive Error             None               
  Second Hard Drive Error      None               

  RAID Status                  Active, RAID 1     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 
Screen Shoot of APM
 

 

Attachments