February 23, 201313 yr Hi, I'm running Unraid 5.0rc10 for a while now with no issues, until early last week. One morning, I noticed one of my data drive had a red ball. Looking at the logs, I got the confirmation that some issue occurred over the night while mover started. See extract of logs below: Feb 18 03:40:17 Mediaserver kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x6 frozen Feb 18 03:40:17 Mediaserver kernel: ata2.00: irq_stat 0x08000000, interface fatal error Feb 18 03:40:17 Mediaserver kernel: ata2: SError: { RecovData UnrecovData Handshk } Feb 18 03:40:17 Mediaserver kernel: ata2.00: failed command: WRITE DMA EXT Feb 18 03:40:17 Mediaserver kernel: ata2.00: cmd 35/00:00:a0:a1:83/00:04:7c:00:00/e0 tag 0 dma 524288 out Feb 18 03:40:17 Mediaserver kernel: res 50/00:00:47:00:cc/00:00:7c:00:00/e0 Emask 0x10 (ATA bus error) Feb 18 03:40:17 Mediaserver kernel: ata2.00: status: { DRDY } Feb 18 03:40:17 Mediaserver kernel: ata2: hard resetting link Feb 18 03:40:17 Mediaserver kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Feb 18 03:40:17 Mediaserver kernel: ata2.00: configured for UDMA/33 Feb 18 03:40:17 Mediaserver kernel: ata2: EH complete Feb 18 03:40:17 Mediaserver kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen Feb 18 03:40:17 Mediaserver kernel: ata2.00: irq_stat 0x08000000, interface fatal error Feb 18 03:40:17 Mediaserver kernel: ata2: SError: { UnrecovData Handshk } Feb 18 03:40:17 Mediaserver kernel: ata2.00: failed command: WRITE DMA EXT Feb 18 03:40:17 Mediaserver kernel: ata2.00: cmd 35/00:00:a0:a1:83/00:04:7c:00:00/e0 tag 0 dma 524288 out Feb 18 03:40:17 Mediaserver kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error) Feb 18 03:40:17 Mediaserver kernel: ata2.00: status: { DRDY } Followed later by this: Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002288/2, count: 1 Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002296/2, count: 1 Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002304/2, count: 1 Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002312/2, count: 1 Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002320/2, count: 1 Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002328/2, count: 1 Feb 18 03:40:20 Mediaserver kernel: md: recovery thread has nothing to resync So, I performed SMART test on the drive, nothing shown up. Reading in the wiki, suggested that it could be related to something else than drive, so I opened the server, check the cables and seemed the SATA cable was not fully plugged (since I had to move the server few days before, could be the reason why it moved). Once I plugged it back properly, I decided to re- initialize the drive. Stopped the array, remove the disk, started the array, stopped again, re-assigned the drive, and let the parity rebuild to be done. Everything was looking fine, until I tried access some of the content stored on that disk. Using XBMC (Frodo) on both my PC and on an Openelec box, movie playback was showing some artefacts, and stopped laying back after few minutes (movies were on the disk prior to the issue). So, I stopped the array, performed reiserfck test: reiserfsck --check started at Sat Feb 23 08:40:09 2013 ########### Replaying journal: Done. Reiserfs journal '/dev/md2' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. \/ 2 (of 14// 4 (of 161\/170 (of 170/block 91002061: The level of the node (11010) is not correct, (1) expected the problem in the internal node occured (91002061), / 21 (of 161// 84 (of 170/block 71424461: The level of the node (33321) is not correct, (1) expected the problem in the internal node occured (71424461), / 32 (of 161\/ 27 (of 170-block 92646861: The level of the node (11861) is not correct, (1) expected the problem in the internal node occured (92646861), / 43 (of 161//123 (of 149|block 129990861: The level of the node (58975) is not correct, (1) expected the problem in the internal node occured (129990861),/ 47 (of 161|/ 52 (of 170|block 68691661: The level of the node (48791) is not correct, (1) expected the problem in the internal node occured (68691661), /150 (of 161// 93 (of 170-block 115648461: The level of the node (24388) is not correct, (1) expected the problem in the internal node occured/ 3 (of 14-/114 (of 161|/ 56 (of 161|block 37180036: The level of the node (57260) is not correct, (1) expected the problem in the internal node occured/ 4 (of 14-/140 (of 148|/103 (of 170\block 117741261: The level of the node (36482) is not correct, (1) expected the problem in the internal node occured/ 5 (of 14|/161 (of 165// 24 (of 170/block 125421261: The level of the node (58497) is not correct, (1) expected the problem in the internal node occured/ 7 (of 14// 83 (of 170|/ 73 (of 170/block 147162061: The level of the node (57552) is not correct, (1) expected the problem in the internal node occured/ 9 (of 14|/114 (of 170|/ 50 (of 170-block 162368461: The level of the node (62602) is not correct, (1) expected the problem in the internal node occured (162368461),/138 (of 170|/162 (of 170-block 166573261: The level of the node (35703) is not correct, (1) expected the problem in the internal node occured/ 10 (of 14\/ 22 (of 170\/138 (of 170/block 175904461: The level of the node (8612) is not correct, (1) expected the problem in the internal node occured (175904461),/ 35 (of 170-/ 40 (of 170-block 177978061: The level of the node (33819) is not correct, (1) expected the problem in the internal node occured (177978061),/ 42 (of 170// 68 (of 170/block 179379661: The level of the node (16992) is not correct, (1) expected the problem in the internal node occured (179379661),/140 (of 170// 48 (of 170/block 182861261: The level of the node (49115) is not correct, (1) expected the problem in the internal node occured/ 11 (of 14|/ 17 (of 135// 22 (of 170\block 190566861: The level of the node (59600) is not correct, (1) expected the problem in the internal node occured (190566861),/ 22 (of 135|/108 (of 170|block 194042061: The level of the node (6611) is not correct, (1) expected the problem in the internal node occured/ 14 (of 14\/ 27 (of 136// 69 (of 170-block 113792461: The level of the node (60719) is not correct, (1) expected the problem in the internal node occurefinished Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Bad nodes were found, Semantic pass skipped 19 found corruptions can be fixed only when running with --rebuild-tree ########### reiserfsck finished at Sat Feb 23 09:07:35 2013 Launched the rebuld-tree as indicated: reiserfsck --rebuild-tree started at Sat Feb 23 09:15:26 2013 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 366373557 blocks marked used Skipping 23115 blocks (super block, journal, bitmaps) 366350442 blocks will be read 0%.. left 177973105, 27759 /secsync_buffers: buffer list is corrupted Aborted Now the disk appears unformatted in Unraid: root@Mediaserver:~# mount /dev/md2 /mnt/disk2 mount: /dev/md2: can't read superblock I tend to believe that it might be a hard disk related issue... but any comment is welcome! Thanks!
February 23, 201313 yr I'd start back with a reiserfsck --check /dev/md2 and work from there. Let it guide you as to how to proceed. unRAID always shows "unformatted" when a drive does not mount. (Sadly, it should show "cannot be mounted" instead if it knows it used to have a file-system on it, but it does not.) Joe L.
February 23, 201313 yr Author Hi Joe, thanks for helping! Did run a new reiserfsck --check, which did abort, because the previous reiserfsck --rebuild-tree was aborted. So, I gave another chance to reiserfsck --rebuild-tree; seems to work now and started to correct some files... looks like there is still few hours to complete (it's already running for 4h30); so I will look at that when I wake up tomorrow morning!
February 24, 201313 yr Author So, the rebuild-tree completed and corrected some files. the following reiserfsck --check did show no error. However, I still had the same issues wit the video. Since I had a copy of one of them, I did a binary compare, which concluded as "files are different". Replaced the one in Unraid with the legacy copy, no more issue. So, it seems the parity is having the wrong information, since it rebuilt the drive as it is now. Don't know what can have impacted it for files which had not been written when I got this issue... Any suggestion about what could be done next ?
February 26, 201313 yr Author Ran a parity check over night... more than 20000 errors detected. Since I have several files impacted, I tend to think that the parity is correct, but not the data on drive. I got a new hard drive today, will install it in the tower as replacement of the problematic drive, will see what happen when Unraid will be rebuilding it
Archived
This topic is now archived and is closed to further replies.