Corrupted disk - Failed rebuild-tree - General Support (V5 and Older)

February 23, 201313 yr

Hi,

I'm running Unraid 5.0rc10 for a while now with no issues, until early last week.

One morning, I noticed one of my data drive had a red ball. Looking at the logs, I got the confirmation that some issue occurred over the night while mover started.

See extract of logs below:

Feb 18 03:40:17 Mediaserver kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x6 frozen
Feb 18 03:40:17 Mediaserver kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Feb 18 03:40:17 Mediaserver kernel: ata2: SError: { RecovData UnrecovData Handshk }
Feb 18 03:40:17 Mediaserver kernel: ata2.00: failed command: WRITE DMA EXT
Feb 18 03:40:17 Mediaserver kernel: ata2.00: cmd 35/00:00:a0:a1:83/00:04:7c:00:00/e0 tag 0 dma 524288 out
Feb 18 03:40:17 Mediaserver kernel:          res 50/00:00:47:00:cc/00:00:7c:00:00/e0 Emask 0x10 (ATA bus error)
Feb 18 03:40:17 Mediaserver kernel: ata2.00: status: { DRDY }
Feb 18 03:40:17 Mediaserver kernel: ata2: hard resetting link
Feb 18 03:40:17 Mediaserver kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 03:40:17 Mediaserver kernel: ata2.00: configured for UDMA/33
Feb 18 03:40:17 Mediaserver kernel: ata2: EH complete
Feb 18 03:40:17 Mediaserver kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
Feb 18 03:40:17 Mediaserver kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Feb 18 03:40:17 Mediaserver kernel: ata2: SError: { UnrecovData Handshk }
Feb 18 03:40:17 Mediaserver kernel: ata2.00: failed command: WRITE DMA EXT
Feb 18 03:40:17 Mediaserver kernel: ata2.00: cmd 35/00:00:a0:a1:83/00:04:7c:00:00/e0 tag 0 dma 524288 out
Feb 18 03:40:17 Mediaserver kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Feb 18 03:40:17 Mediaserver kernel: ata2.00: status: { DRDY }

Followed later by this:

Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error
Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002288/2, count: 1
Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error
Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002296/2, count: 1
Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error
Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002304/2, count: 1
Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error
Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002312/2, count: 1
Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error
Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002320/2, count: 1
Feb 18 03:40:20 Mediaserver kernel: md: disk2 write error
Feb 18 03:40:20 Mediaserver kernel: handle_stripe write error: 2089002328/2, count: 1
Feb 18 03:40:20 Mediaserver kernel: md: recovery thread has nothing to resync

So, I performed SMART test on the drive, nothing shown up.

Reading in the wiki, suggested that it could be related to something else than drive, so I opened the server, check the cables and seemed the SATA cable was not fully plugged (since I had to move the server few days before, could be the reason why it moved).

Once I plugged it back properly, I decided to re- initialize the drive. Stopped the array, remove the disk, started the array, stopped again, re-assigned the drive, and let the parity rebuild to be done.

Everything was looking fine, until I tried access some of the content stored on that disk. Using XBMC (Frodo) on both my PC and on an Openelec box, movie playback was showing some artefacts, and stopped laying back after few minutes (movies were on the disk prior to the issue).

So, I stopped the array, performed reiserfck test:

reiserfsck --check started at Sat Feb 23 08:40:09 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md2' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. \/  2 (of  14//  4 (of 161\/170 (of 170/block 91002061: The level of the node (11010) is not correct, (1) expected
the problem in the internal node occured (91002061), / 21 (of 161// 84 (of 170/block 71424461: The level of the node (33321) is not correct, (1) expected
the problem in the internal node occured (71424461), / 32 (of 161\/ 27 (of 170-block 92646861: The level of the node (11861) is not correct, (1) expected
the problem in the internal node occured (92646861), / 43 (of 161//123 (of 149|block 129990861: The level of the node (58975) is not correct, (1) expected
the problem in the internal node occured (129990861),/ 47 (of 161|/ 52 (of 170|block 68691661: The level of the node (48791) is not correct, (1) expected
the problem in the internal node occured (68691661), /150 (of 161// 93 (of 170-block 115648461: The level of the node (24388) is not correct, (1) expected
the problem in the internal node occured/  3 (of  14-/114 (of 161|/ 56 (of 161|block 37180036: The level of the node (57260) is not correct, (1) expected
the problem in the internal node occured/  4 (of  14-/140 (of 148|/103 (of 170\block 117741261: The level of the node (36482) is not correct, (1) expected
the problem in the internal node occured/  5 (of  14|/161 (of 165// 24 (of 170/block 125421261: The level of the node (58497) is not correct, (1) expected
the problem in the internal node occured/  7 (of  14// 83 (of 170|/ 73 (of 170/block 147162061: The level of the node (57552) is not correct, (1) expected
the problem in the internal node occured/  9 (of  14|/114 (of 170|/ 50 (of 170-block 162368461: The level of the node (62602) is not correct, (1) expected
the problem in the internal node occured (162368461),/138 (of 170|/162 (of 170-block 166573261: The level of the node (35703) is not correct, (1) expected
the problem in the internal node occured/ 10 (of  14\/ 22 (of 170\/138 (of 170/block 175904461: The level of the node (8612) is not correct, (1) expected
the problem in the internal node occured (175904461),/ 35 (of 170-/ 40 (of 170-block 177978061: The level of the node (33819) is not correct, (1) expected
the problem in the internal node occured (177978061),/ 42 (of 170// 68 (of 170/block 179379661: The level of the node (16992) is not correct, (1) expected
the problem in the internal node occured (179379661),/140 (of 170// 48 (of 170/block 182861261: The level of the node (49115) is not correct, (1) expected
the problem in the internal node occured/ 11 (of  14|/ 17 (of 135// 22 (of 170\block 190566861: The level of the node (59600) is not correct, (1) expected
the problem in the internal node occured (190566861),/ 22 (of 135|/108 (of 170|block 194042061: The level of the node (6611) is not correct, (1) expected
the problem in the internal node occured/ 14 (of  14\/ 27 (of 136// 69 (of 170-block 113792461: The level of the node (60719) is not correct, (1) expected
the problem in the internal node occurefinished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Bad nodes were found, Semantic pass skipped
19 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Sat Feb 23 09:07:35 2013

Launched the rebuld-tree as indicated:

reiserfsck --rebuild-tree started at Sat Feb 23 09:15:26 2013
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 366373557 blocks marked used
Skipping 23115 blocks (super block, journal, bitmaps) 366350442 blocks will be read
0%..                                                  left 177973105, 27759 /secsync_buffers: buffer list is corrupted
Aborted

Now the disk appears unformatted in Unraid:

root@Mediaserver:~# mount /dev/md2 /mnt/disk2
mount: /dev/md2: can't read superblock

I tend to believe that it might be a hard disk related issue... but any comment is welcome!

Thanks!

Quote

February 23, 201313 yr

I'd start back with a

reiserfsck --check /dev/md2

and work from there.

Let it guide you as to how to proceed.

unRAID always shows "unformatted" when a drive does not mount. (Sadly, it should show "cannot be mounted" instead if it knows it used to have a file-system on it, but it does not.)

Joe L.

Quote

February 23, 201313 yr

Author

Hi Joe,

thanks for helping!

Did run a new reiserfsck --check, which did abort, because the previous reiserfsck --rebuild-tree was aborted.

So, I gave another chance to reiserfsck --rebuild-tree; seems to work now and started to correct some files... looks like there is still few hours to complete (it's already running for 4h30); so I will look at that when I wake up tomorrow morning!

Quote

February 24, 201313 yr

Author

So,

the rebuild-tree completed and corrected some files. the following reiserfsck --check did show no error. However, I still had the same issues wit the video.

Since I had a copy of one of them, I did a binary compare, which concluded as "files are different". Replaced the one in Unraid with the legacy copy, no more issue.

So, it seems the parity is having the wrong information, since it rebuilt the drive as it is now. Don't know what can have impacted it for files which had not been written when I got this issue...

Any suggestion about what could be done next ?

Quote

February 26, 201313 yr

Author

Ran a parity check over night... more than 20000 errors detected. Since I have several files impacted, I tend to think that the parity is correct, but not the data on drive.

I got a new hard drive today, will install it in the tower as replacement of the problematic drive, will see what happen when Unraid will be rebuilding it

Quote

Corrupted disk - Failed rebuild-tree

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)