☰
Current Page
Main Menu
Home
Home
Editing
HasturRaidRecovery
Edit
Preview
h1
h2
h3
Keybinding
default
vim
emacs
Markup
Markdown
Plain Text
Pod
RDoc
reStructuredText
AsciiDoc
BibTeX
Creole
MediaWiki
Org-mode
Textile
Help 1
Help 1
Help 1
Help 2
Help 3
Help 4
Help 5
Help 6
Help 7
Help 8
Autosaved text is available. Click the button to restore it.
Restore Text
--- title: HasturRaidRecovery --- ## 2009-06-29 * [linux-raid thread][1] ### Log Analysis * Controller(?) timed out and sdc3 ejected Jun 29 20:47:07 hastur kernel: ata11.00: failed to read SCR 1 (Emask=0x40) Jun 29 20:48:49 hastur kernel: INFO: task md3_raid5:3352 blocked for more than 120 seconds Jun 29 20:48:58 hastur kernel: ata11.02: hard resetting link Jun 29 20:48:58 hastur kernel: ata11.02: failed to read SCR 2 (Emask=0x40) Jun 29 20:48:58 hastur kernel: ata11.02: failed to read SCR 2 (Emask=0x40) Jun 29 20:48:58 hastur kernel: ata11.02: COMRESET failed (errno=-5) Jun 29 20:48:58 hastur kernel: ata11.02: failed to read SCR 0 (Emask=0x40) Jun 29 20:48:58 hastur kernel: ata11.02: reset failed, giving up Jun 29 20:48:58 hastur kernel: ata11.02: failed to recover link after 3 tries, disabling Jun 29 20:48:58 hastur kernel: ata11.02: disabled Jun 29 20:49:08 hastur kernel: sd 10:2:0:0: rejecting I/O to offline device Jun 29 20:49:08 hastur kernel: sd 10:2:0:0: rejecting I/O to offline device Jun 29 20:49:08 hastur kernel: ata11: EH complete Jun 29 20:49:08 hastur kernel: sd 10:2:0:0: rejecting I/O to offline device Jun 29 20:49:08 hastur kernel: raid5: Disk failure on sdc3, disabling device. Operation continuing on 5 devices Jun 29 20:49:11 hastur kernel: ata11.02: detaching (SCSI 10:2:0:0) * Hot-removed sdd3 from array after enclosure alarm Jun 29 20:57:47 hastur kernel: ata11.03: disabled Jun 29 20:57:47 hastur kernel: sd 10:3:0:0: rejecting I/O to offline device Jun 29 20:57:47 hastur kernel: sd 10:3:0:0: rejecting I/O to offline device Jun 29 20:57:47 hastur kernel: sd 10:3:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Jun 29 20:57:47 hastur kernel: end_request: I/O error, dev sdd, sector 404675832 Jun 29 20:57:47 hastur kernel: raid5:md3: read error not correctable (sector 298261272 on sdd3). Jun 29 20:57:47 hastur kernel: raid5: Disk failure on sdd3, disabling device. Operation continuing on 4 devices Jun 29 20:57:48 hastur kernel: ata11.03: detaching (SCSI 10:3:0:0) * Marked array as readonly * Shutdown and removed system for maintenance. * On reboot disks were renumbered. * Ran non-destructive read/write badblocks test on all disks (ALL CLEAN) * Attempted to re-add failed disks to array. * Somehow managed to rewrite superblocks on disks I was attempting to re-add. ## data_offset * Old version of mdadm created the original array with data offset of 136 sectors into each component device. * Versions since mdadm-2.6 support a new bitmap feature which moves the data offset to 272 sectors. * Hexediting the data offset and fixing the superblock checksum would be safe [according to Neil Brown][2]. * In the end though I compiled from source the version of mdadm that was used to create the original array. ### diff -u md3.sdb3.orig md3.sde3.new --- md3.sdb3.orig 2009-07-07 10:14:25.000000000 +0100 +++ md3.sde3.new 2009-07-07 10:14:27.000000000 +0100 @@ -2,26 +2,26 @@ Magic : a92b4efc Version : 1.2 Feature Map : 0x0 - Array UUID : 2b7ca9c9:c9fa9a28:086e0f83:90cbef62 + Array UUID : 679bc68c:aeb0464c:8f11e607:c8e58161 Name : hastur:3 (local to host hastur) - Creation Time : Thu Oct 18 14:46:47 2007 + Creation Time : Sun Jul 5 03:32:38 2009 Raid Level : raid5 Raid Devices : 6 - Avail Dev Size : 870353369 (415.02 GiB 445.62 GB) + Avail Dev Size : 870353233 (415.02 GiB 445.62 GB) Array Size : 4351765760 (2075.08 GiB 2228.10 GB) Used Dev Size : 870353152 (415.02 GiB 445.62 GB) - Data Offset : 136 sectors + Data Offset : 272 sectors Super Offset : 8 sectors State : clean - Device UUID : c4983266:9ee820fd:106bbf9d:20a69333 + Device UUID : 1b87acce:883de3fc:881f279e:e2b84a9b - Update Time : Mon Jun 29 21:05:55 2009 - Checksum : 5a501eb5 - correct - Events : 320840 + Update Time : Sun Jul 5 03:32:38 2009 + Checksum : 94ba3ae4 - correct + Events : 0 Layout : left-symmetric Chunk Size : 128K - Array Slot : 0 (failed, 1, 2, failed, failed, 4, 5) - Array State : _uu_uu 3 failed + Array Slot : 0 (0, 1, 2, 3, 4, 5) + Array State : Uuuuuu ## Loopback devices * Created sparse loopback devices (first 50MB of each partition) to play with superblocks #!/bin/sh BLOCKS_PER_DEV=$(sfdisk -s /dev/sdb3} for i in {b..g} do BLOCKS=$(sfdisk -s /dev/sd${i}3) # blocks BLOCKS=$(sfdisk -s /dev/sd${i}3) # blocks dd if=/dev/sd${i}3 of=isd${i}3 bs=512 count=102400 # 50MB dd if=/dev/zero of=isd${i}3 bs=1k seek=$BLOCKS count=0 losetup -f isd${i}3 done ## Permute * Quick c++ to permute order of devices * Output space-separated, one permutation per line ### permute-loop.cpp * Permute [012345] #include <algorithm> #include <iterator> #include <vector> #include <iostream> using namespace std; int main(void) { vector<int> v; v.push_back(0); v.push_back(1); v.push_back(2); v.push_back(3); v.push_back(4); v.push_back(5); cout << "0 1 2 3 4 5" << endl; // initial while (next_permutation(v.begin(), v.end() ) ) { // Loop until all permutations are generated. copy(v.begin(), v.end(), ostream_iterator<int>(cout, " ")); cout << endl; } return 0; } ### permute-real.cpp * Permute [bcdefg] #include <algorithm> #include <iterator> #include <vector> #include <iostream> using namespace std; int main(void) { vector<char> v; v.push_back('b'); v.push_back('c'); v.push_back('d'); v.push_back('e'); v.push_back('f'); v.push_back('g'); cout << "b c d e f g" << endl; while (next_permutation(v.begin(), v.end() ) ) { // Loop until all permutations are generated. copy(v.begin(), v.end(), ostream_iterator<char>(cout, " ")); cout << endl; } return 0; } ### Compile g++ -o permute-loop permute-loop.cpp g++ -o permute-real permute-real.cpp ## Recovery script #!/bin/sh ECHO= # set to echo to test MDADM=mdadm-2.5.6 # old version for old superblock data_offset size MD_DEV=md3 CRYPT_DEV=crypt-md3 ./permute-real | while read b c d e f g do echo /dev/sd${b}3 /dev/sd${c}3 /dev/sd${d}3 /dev/sd${e}3 /dev/sd${f}3 /dev/sd${g}3 echo 'y' | $ECHO $MDADM -C --assume-clean -f -e 1.2 -l 5 -p ls -c 128 -n6 /dev/$MD_DEV /dev/sd${b}3 /dev/sd${c}3 /dev/sd${d}3 /dev/sd${e}3 /dev/sd${f}3 /dev/sd${g}3 &> /dev/null if (($? == 0)) then sleep 0.3s $ECHO mdadm -o /dev/$MD_DEV if ($ECHO cryptsetup isLuks /dev/$MD_DEV ) then echo -n " LUKS " echo "$PASSWORD" | if ($ECHO cryptsetup -T1 luksOpen /dev/$MD_DEV $CRYPT_DEV ) then echo -n " UNLOCKED " if ( $ECHO mount -o ro /dev/mapper/$CRYPT_DEV mnt ) then echo -n " MOUNTED " $ECHO umount /dev/mapper/$CRYPT_DEV fi $ECHO cryptsetup luksClose $CRYPT_DEV fi fi sleep 0.3s $ECHO mdadm --stop /dev/$MD_DEV &> /dev/null fi echo "" done ## XFS Repair * XFS wouldn't mount read-only if there were errors. (So the script was inconclusive). * Ran xfs_repair -n to determine which (of the two probable) configurations would need the fewest filesystem changes. * Recreated correct configuration mdadm-2.5.6 -C --assume-clean -f -e 1.2 -l 5 -p ls -c 128 -n6 /dev/md3 /dev/sde3 /dev/sdd3 /dev/sdg3 /dev/sdf3 /dev/sdc3 /dev/sdb3 * Run mdadm check, speed limit echo -n check > /sys/block/md3/md/sync_action echo -n 10000 > /proc/sys/dev/raid/speed_limit_max * Open, mount and unmount XFS, xfs_repair `xfs_repair /dev/mapper/crypt-md3` * xfs\_repair reported that the log needed to be replayed by mount/umounting, then rerunning xfs\_repair mount /dev/mapper/crypt-md3 /mnt/md3 umount /mnt/md3 xfs_repair /dev/mapper/crypt-md3 * Final mount `mount /mnt/md3` # Force Assemble? * Recover array faster by forcing assemble: clear failed flag from enough disks to assemble # mdadm --assemble --force --scan /dev/md3 mdadm: forcing event count in /dev/sdd3(2) from 5 upto 10 mdadm: clearing FAULTY flag for device 3 in /dev/md3 for /dev/sdd3 mdadm: /dev/md3 has been started with 5 drives (out of 6). * Mark as readonly # mdadm -o /dev/md3 * How do we forcibly re-add a failed drive? [1]: http://marc.info/?t=124696420800003&r=1&w=2 [2]: http://marc.info/?l=linux-raid&m=124710325903455&w=2 <!-- vim: filetype=markdown -->
Uploading file...
Sidebar
# SideBar * [Home][1] * [Projects][2] * * * <!-- --> * [Code][3] * [Tech][4] * [Network][5] * [MediaCentre][6] * [UAV][7] * * * <!-- --> * [Travel][8] * [Music][9] * [Horse Riding][10] * [Study][11] * [Games][12] * [Other Activities][13] * * * <!-- --> * [Car][14] * [House][15] * [Watch][16] * [Clothing][17] * [Miscellany][18] * * * [1]: /Home [2]: /Projects [3]: /Code/Code [4]: /Tech/Tech [5]: /Network/Network [6]: /MediaCentre/MediaCentre [7]: /UAV/UAV [8]: /Travel/Travel [9]: /Music/Music [10]: /HorseRiding/HorseRiding [11]: /Study/Study [12]: /Games/Games [13]: /Do/Do [14]: /Car/Car [15]: /House/House [16]: /Watch/Watch [17]: /Clothing/Clothing [18]: /Miscellany/Miscellany <!-- vim: filetype=markdown -->
Edit message:
Cancel