Two drive failures

Donovan · April 10, 2015

As the subject suggests, I've not had the best of luck and think this illustrates the need, although perhaps rare, for at least two parity drives when you have a large number of drives.

Background: I was going to change the hardware out on my system, as the hardware was overkill and I'd given up having given up on using unRAID/Xen on the system as a dual purpose file server/Windows desktop machine

Prior to swapping out the mainboard/CPU/RAM to use in a dedicated desktop, I had started a parity check. When I checked it the next day, it appeared the relatively new 3TB drive that was still empty started giving read errors at the very end of the parity check and went offline.

The drive had been run through pre-clear about two months earlier without any signs of trouble. I had added it to the array but had not actually moved or added any files to it. I tried checking cables and moving the drive to another 5-in-3 drive bay but the drive would only making head clicking noises as it tried to come online.

I decided to go ahead with the mainboard swap, since the drives weren't being moved. I decided, since the failed drive had nothing on it, to do a New Config and simply rebuild the parity. Only 5 minutes into the rebuild, one of the drives also started giving read errors. This is the same drive that had just gone through a 10 hour parity check the night before without any errors. I checked cables and unRAID kept indicating the drive was Unmountable.

I moved the drive to another bay and yet unRAID kept saying the drive was Unmountable and the popup said the drive was in an error state. Moved the drive to another drive bay and the same error. This time there is no clicking from the drive, it seems to spin up but unRAID believes it's basically unformatted.

Edit: I've been advised in a reply that I should be using /dev/sdd1

root@Crushinator:/home# reiserfsck --check /dev/sdd
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/sdd
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

reiserfs_open: the reiserfs superblock cannot be found on /dev/sdd.
Failed to open the filesystem.

If the partition table has not been changed, and the partition is
valid  and  it really  contains  a reiserfs  partition,  then the
superblock  is corrupted and you need to run this utility with
--rebuild-sb.

It would appear I should be reading through the second page of http://lime-technology.com/forum/index.php?topic=39149.0 as he seems to have the same issue. (Edit: No, different issue). Just wanted to pass along my experience and I welcome any suggestions. I find it really unlucky for both drives to go like this but now this has me thinking that, in large (14 drive) arrays like this, I may have to abandon unRAID and move on to something with dual parity like SnapRAID, even though the beauty of unRAID is the web GUI. $:-\$

...Donovan

Edited to tweak formatting and clarify slightly.

itimpi · April 10, 2015

That reiserfsck will have failed because you used the wrong device name. It should have been /dev/sdd1. When using the raw devices you have to specify the partition (as opposed to using the /dev/md? type devices where you do not).

Donovan · April 11, 2015

Thanks. Good catch.

I forgot to mention I had run a self-test, which did confirm the drive failure:

Disk 9 attached to port: sdd
Num	Test Description	Status	Remaining	LifeTime(hours)	LBA of first error
1	Extended offline	Completed: read failure	30%	32353	290878036

And running against the correct partition didn't appear to indicate anything wrong?

root@Crushinator:/home# reiserfsck --check /dev/sdd1
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/sdd1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sat Apr 11 07:48:10 2015
###########
Replaying journal: Trans replayed: mountid 213, transid 21044, desc 4723, len 1, commit 4725, next trans offset 4708
Trans replayed: mountid 213, transid 21045, desc 4726, len 1, commit 4728, next trans offset 4711
Trans replayed: mountid 213, transid 21046, desc 4729, len 1, commit 4731, next trans offset 4714
Trans replayed: mountid 213, transid 21047, desc 4732, len 1, commit 4734, next trans offset 4717
Trans replayed: mountid 213, transid 21048, desc 4735, len 1, commit 4737, next trans offset 4720
Trans replayed: mountid 213, transid 21049, desc 4738, len 1, commit 4740, next trans offset 4723
Replaying journal: Done.                                                        
Reiserfs journal '/dev/sdd1' in blocks [18..8211]: 6 transactions replayed
Checking internal tree.. finished                                
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                                       
No corruptions found
There are on the filesystem:
Leaves 401619
Internal nodes 2412
Directories 109
Other files 9217
Data block pointers 404063516 (0 of them are zero)
Safe links 0
###########
reiserfsck finished at Sat Apr 11 07:59:08 2015
###########

Edit: I powered the server down and brought it back up but unRAID still doesn't like the drive.

Unmountable disk present:

Disk 9 • WDC_WD20EARS-00MVWB0_WD-WMAZ20195097 (sdd)		Format will create a file system in all Unmountable disks, discarding all data currently on those disks.

Edit 2: In checking the reiserfschk options, I see that it doesn't scan all sectors of the drive (duh, that would take an extremely long time), so perhaps the sectors that can't be read are not currently occupied by any files... Presumably unRAID is checking every bit on the partition and not just the files themselves when it does the parity check, and that's when it encountered the read error.

--scan-whole-partition, -S
This option causes --rebuild-tree to scan the whole partition, not only used space on the partition.

Edit 3: I tried removing the drive from the array, starting the array, then stopping it, adding the drive back and restarting; still unmountable.

I picked up a WD Red 3TB in the hopes of being able to transfer what I can off of the drive so I can rebuild parity. I just don't know how to access the drive in order to copy them over from the console. I'm running a pre-clear on the new drive in the meantime.

Thanks for any suggestions.

...Donovan

Donovan · April 11, 2015

Should have realized it would be something this trivial:

mkdir /mnt/baddrive
mount /dev/sdd1 /mnt/baddrive

The replacement drive is still going through preclear_bjp but presumably the next steps are to add the drive to the array and use the copy (cp) command to copy everything to the new drive. After that, I'll build parity, unless I'm required to do that as part of adding the new drive.

I'll continue updating this post in case it is helpful to someone in the future.

...Donovan

Donovan · April 14, 2015

I've run into an interesting bug in beta14b:

I tried adding the new 3TB drive in place of the 2TB drive that failed and was marked unmountable with read errors by unRAID previously. The parity has been lost because another drive died the day before this.

The first time I tried this, it simply started the array and said it was unmountable, just the same as before. I had to call it a night at the time.

Now, when I stop the array, set the drive to none, start the array, stop the array and add the 3TB in, it indicates I'm doing a drive swap, and I need to check a box to confirm this and bring the array online. When I do this, it still says it is unmountable but it says the drive is unformatted down below. When I check to confirm I want to format and click to do it, the screen refreshes and nothing has changed. No formatting progress or anything else; the drive is still unmountable and I'm told the drive is unformatted.

I wanted to add the drive to the array so that when I copy files from the failing drive to it, they won't have to be wiped out in order to add the drive to the array later.

...Donovan

dgaschk · April 15, 2015

Attach a syslog. Run checkdisk on the flash drove in a PC or Mac.

Donovan · April 15, 2015

Hey,

I gave up on any responses, so after posting a defect report (with syslog) to the best of my ability (http://lime-technology.com/forum/index.php?topic=39288), I continued onward:

I stopped the array and used New Config. After re-assigning all of the drives in the same positions, I restarted the array. It then came up that the drive was still unmountable and yet started a parity rebuild anyway. I stopped the parity rebuild and used the format function. This time it actually worked and started formatting the drive.

From there, I wanted to copy the files from the failing drive, since it seemed fully intact except for read errors in presumably unused areas of the drive:

cp -Rfv /mnt/baddrive/* /mnt/disk9/

I had my terminal (Xshell 5 Free) set to buffer 10,240 lines, so I right-clicked on the console and had it sent to Notepad, where a search for "error" didn't bring up any errors from the copy process. I realize this doesn't rule out any corrupted files that the drive might not be aware of, but hey, I could have lost the entire drive!

I had considered running rsync -c to checksum all of the files and see if they are the same on both drives but washn't sure this was worth the effort, since the files could be corrupted on the original drive and that not seeing a difference would not confirm they were perfect, only point out that the copy from the other day isn't matching the source this time around.

I also considered finding a utility run through all of the video files, to do a basic confirmation that they are not corrupt. A brief search shows there is no simple tool for this, video files being what they are, so after checking that a few files play back are perfectly fine, I'm going to assume they all fine until I eventually find out that a particular file is corrupt when I try to play it. I'm fairly confident they are perfectly fine, though.

I really think unRAID needs to evolve to support dual parity and address bit rot. For now, I'll keep running it, because I've been running it for a few years but the GUI is the only thing currently keeping me from leaving for SnapRAID/MHDDFS. If I find something equivalent (nothing will be as great as the current GUI in unRAID), I will be severely tempted to give it a shot.

I hope the detail in these posts helps anyone else in a similar situation.

...Donovan

itimpi · April 15, 2015

I really think unRAID needs to evolve to support dual parity and address bit rot. For now, I'll keep running it, because I've been running it for a few years but the GUI is the only thing currently keeping me from leaving for SnapRAID/MHDDFS. If I find something equivalent (nothing will be as great as the current GUI in unRAID), I will be severely tempted to give it a shot.

I agree that dual parity would be a great addition.

Note that BTRFS detects bitrot, so as long as the BTRFS support is stable enough that could be the best way of getting that issue addressed.

Two drive failures

Recommended Posts

Donovan

Link to comment

itimpi

Link to comment

Donovan

Link to comment

Donovan

Link to comment

Donovan

Link to comment

dgaschk

Link to comment

Donovan

Link to comment

itimpi

Link to comment

Join the conversation