Solution
Background
Hardware Gateway appliances are built with a two-disk RAID 1 array. Hard disk drive failures can occur and this document will work to assist an engineer or operator to identify and isolate a faulted disk and replace it with a vendor-supplied replacement. If a hardware appliance has a suspected disk failure then please open a new Support issue. This article serves as a reference document for other necessary work but a case must be opened with CA Technologies API Management Support if a disk failure is suspected.
Resolution
- One of the following logs will be present in the appliance logs. Use the highlighted value to identify the logical identifier of the failed device:
kernel: sda: Current: sense key: Hardware Error
kernel: sd 0:0:0:0: [sda] Sense Key : Hardware Error [current]
- Use the logical identifier of the failed device to find the serial number of the impacted drive: smartctl -a /dev/sdX | grep -i serial
- Power off the Gateway appliance
- Remove the existing disks from the appliance
- Use the serial number from the previous step to identify the failed drive
- Install the new disk in place of the failed disk
- Power on the Gateway appliance
NOTE: It is possible that the system will fail to start successfully and may place the operator in the GRUB shell.
Please review the article on reloading the GRUB configuration in a hardware appliance Gateway for more information on resolving this issue.
- Log in to the Gateway as the root user
- Ensure both hard disks are present: fdisk -l 2> /dev/null | sed '/Disk \/dev\/md/,+5d' | grep Disk
NOTE: Two identically sized devices should be displayed. The smaller device may be ignored. The output may appear as follows:
Disk /dev/sdb: 300.0 GB, 300000000000 bytes
Disk /dev/sdc: 3880 MB, 3880452096 bytes
Disk /dev/sdd: 300.0 GB, 300000000000 bytes
- Verify the logical identifier of the new disk: for i in `egrep -v 'md[0-9]|sd[a-z][0-9]' /proc/partitions | tail -3 | awk ' $3 > 3789504 ' | awk '{print $4}'`; do echo -ne "$i "; grep -c $i /proc/partitions ; done
NOTE: ?The new disk is the logical identifier with the lower value printed afterwards. The existing disk is the logical identifier with the higher value printed afterwards. Please note which disk is new and which disk is existing. An example output is as follows:
sdb 6
sdd 1
- Copy the partition table from the existing disk to the new disk: sfdisk -d /dev/sdb | sfdisk /dev/sdd
?NOTE: The following messages should be printed to the console if the operation completes successfully:
Successfully wrote the new partition table
Re-reading the partition table ...
- Reassemble the RAID arrays for the new disk:
mdadm --manage /dev/md0 --add /dev/sdX1
mdadm --manage /dev/md2 --add /dev/sdX2
mdadm --manage /dev/md1 --add /dev/sdX5
NOTE: The value of /dev/sdX will reflect the logical identifier of the new disk
- Periodically check the synchronization of the RAID arrays: cat /proc/mdstat
?NOTE: A set of RAID arrays currently undergoing synchronization may appear as follows:
md0 : active raid1 sdd1[0] sdb1[1]
104320 blocks [2/2] [UU]
md2 : active raid1 sdd2[2] sdb2[1]
5245120 blocks [2/1] [_U]
[====>................] recovery = 21.1% (1110528/5245120) finish=0.4min speed=158646K/sec
md1 : active raid1 sdd5[2] sdb5[1]
283418624 blocks [2/1] [_U]
resync=DELAYED
- ?Verify the completion of the RAID arrays: cat /proc/mdstat
?NOTE: A synchronized set of RAID arrays will appear as follows:
md0 : active raid1 sdd1[0] sdb1[1]
104320 blocks [2/2] [UU]
md2 : active raid1 sdd2[0] sdb2[1]
5245120 blocks [2/2] [UU]
md1 : active raid1 sdd5[2] sdb5[1]
283418624 blocks [2/2] [UU]
The process is completed once the status of the RAID is indicated as specified above.