Workaround to prevent "RAID Status Broken, RAID 1" - (includes recovery procedure)

book

Article ID: 167845

calendar_today

Updated On:

Products

XOS

Issue/Introduction

Workaround to prevent "RAID Status Broken, RAID 1" - (includes recovery procedure)Customers with CPM-8600 modules and APM-8600 modules with optional hard disks installed may experience a minor alarm due to a "Broken RAID 1" status.  An example appears below:

--------------------- show alarms minor ----------------------
Minor Alarms: 
  cp1 RAID Status                  Broken, RAID 1

Cause

This condition can be directly triggered by the “ZZsmartctl_longtest" or “ZZsmartctl_shorttest" hard drive diagnostic utilities.  In XOS version 7.3.2 and earlier XOS versions, these utilities are executed automatically via daily and weekly cron jobs.

Resolution

In XOS V8.1.4 and later versions of XOS, the ZZsmartctl_longtest and ZZsmartctl_shorttest hard drive diagnostic self-tests are disabled by default.  

For example:

Remove script links in the cron job directories:

[[email protected] bin]# rm /etc/cron.daily/ZZsmartctl_shorttest
rm: remove `/etc/cron.daily/ZZsmartctl_shorttest'? y  (type “y” to delete)

[[email protected] bin]# rm /etc/cron.weekly/ZZsmartctl_longtest
rm: remove `/etc/cron.weekly/ZZsmartctl_longtest'? y  (type “y” to delete)

 

Broken RAID1 Status Recovery Procedure:

A) RAID recovery on the active CPM or any APM:
If there is the ability to attempt recovery of the broken RAID1 status, the user can halt the CPM/APM, and reseat it. After the CPM/APM boots, check /proc/scsi/scsi and observe whether two disks are recognized.

For example:
Prior to recovery attempt against the CPM/APM (only one disk shown under “Attached devices:”):

[[email protected]CBS root]# cat /proc/scsi/scsi
Attached devices:
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HTE721010G9SA00  Rev: MCZO
  Type:   Direct-Access                    ANSI SCSI revision: 05


After reseating the CPM/APM (two disks shown under “Attached devices:”)

[[email protected]CBS root]# cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HTE721010G9SA00  Rev: MCZO
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: HTE721010G9SA00  Rev: MCZO
  Type:   Direct-Access                    ANSI SCSI revision: 05

If there are two disks shown within the /proc/scsi/scsi file, one can use the /crossbeam/bin/xos-raid-add utility to start RAID sync.

[[email protected]CBS root]# /crossbeam/bin/xos-raid-add

Further documentation related to RAID configuration can be found within the XOS Configuration Guide in "Appendix B RAID-Related Hard Drive Configuration and Repair".


B) RAID recovery on the standby CPM:
If the RAID recovery procedure must be performed on the standby CPM, the following procedure must be performed prior to any of the steps described in section (A) above. The example below assumes that cp1 is the online CPM and cp2 is the standby CPM:

1. On cp1, enter the following command:
  
     CBS# show cp-redundancy


2. Verify that "Disk synchronization is 100% completed"

3. On cp1 enter the following command

  CBS# configure cp-redundancy set cp2 offline

     This will reboot the cp2 and put it into the offline state.


4. After cp2 has booted up, follow the RAID recovery procedure, above (procedure A).

5. After the RAID recovery procedure is finished on cp2, log into cp1 again and enter the following commands.

  CBS# configure cp-redundancy set cp2 election
  CBS# reload offline-cp

     This will cause cp2 to reboot again into the "standby" state. 

6. After cp2 has booted up, check the status of cp-redundancy on cp1 again using this command

  CBS# show cp-redundancy

Workaround

The recommended preventive action for existing systems is to manually disable the daily and weekly cron jobs.