--rebuild-tree failing Disk2 unmountable

cpthook · January 17, 2017

Forum, I need help!

I woke up yesterday morning to find reiserfs errors in my syslog pointing to md2, which was converted automatically to R/O. So I rebooted and now disk2 is un-mountable. S.M.A.R.T. reports no failures (which is not really saying anything) and disk is green balled. Per wiki recommendations I have done the following:

1. Place array in Maintenance mode and run file system check (reiserfsck --check /dev/md2). File system check noticed one error that required me to use the --rebuild-tree command to fix.

2. --rebuild-tree failed to finish and points to a "segfault error"; re-ran, but fails at the exact same point. Reran reiserfsck --check /dev/md2 and got: Bad root block 0. (--rebuild-tree did not complete)

3. Installed new replacement disk; started array with drive2 unassigned; stopped array and assigned new disk to slot and got a blue ball. unRAID is stating that by bringing the array online, it will begin to rebuild the disk. So I fired up the array, but drive is still un mountable both during and after rebuild finished.

4. Tried to --rebuild-tree on new rebuild drive, but fails at the same point as the original drive

Considering that the file system recovery is not working and I want to salvage the data on the drive, what would I do next?

diagnostics attached

tower-diagnostics-20170116-0807.zip

JorgeB · January 17, 2017

First thing would be to reboot and run memtest for a few hours, bad RAM can be the reason for both the file system corruption and the failure to fix it.

Rebuilding a disk never fixes file system corruption, disk is rebuilt as it was, file system corruption included.

cpthook · January 17, 2017

Thanks Johnnie.Black!

Would it matter that I have ECC RAM when running memtest? I just added a new ECC stick the day before the failure!!?

JorgeB · January 17, 2017

Check event log on bios or with ipmiview and look for ECC errors, also can't harm to remove that dimm and try reiserfsck again.

cpthook · January 17, 2017

Check event log on bios or with ipmiview and look for ECC errors,

IPMI view shows video controller errors, but I think these are logged each time I reboot my VM. I have an NVIDIA card passing through to my desktop VM. I see no errors relating to ECC memory.

67	2017/01/07 17:55:58	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion
68	2017/01/07 23:22:46	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion
69	2017/01/10 12:44:46	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion
70	2017/01/11 12:36:58	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion
71	2017/01/11 18:19:26	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion
72	2017/01/12 23:35:59	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion
73	2017/01/12 23:45:25	BIOS POST Progress #0x00	BIOS POST Progress	Error-Unrecoverable video controller failure. - Assertion

also can't harm to remove that dimm and try reiserfsck again.

I'll give it a shot and report back!

cpthook · January 18, 2017

So.. I powered off the machine, re-seated RAM, installed old drive2, powered machine back on and tried --rebuild-tree for the fourth time and got the results below. I'm at a crossroads point now because I have been an unRAID user since 2011 and have never encountered an issue this grave.

I'm in maintenance mode now, awaiting further advice because i only know 2 other options at this point:

-Format emulated drive and rebuild Parity, although I don't know the consequences beyond loss of data.

-Trust Parity with new configuration, although I know that is dangerous due to Parity possibly containing reiserfs errors from emulated drive2.

That being said, I see no options with any positive outcomes aside from possibly trying to salvage the rest of my data. Are there any other options?

Will rebuild the filesystem (/dev/sdd1) tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
Replaying journal: Done.
Reiserfs journal '/dev/sdd1' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Tue Jan 17 18:43:57 2017
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 371236265 blocks marked used
Skipping 30567 blocks (super block, journal, bitmaps) 371205698 blocks will be read
0%....20%..block 137989396: The number of items (63998) is incorrect, should be (1) - corrected
block 137989396: The free space (65024) is incorrect, should be (4047) - corrected
pass0: vpf-10110: block 137989396, item (0): Unknown item type found [4127064311 16121344 0xfe00f9fe  (15)] - deleted
Segmentation fault

JorgeB · January 18, 2017

At this point I would try a couple of things:

-downgrade to v6.2.4, it uses the earlier version of reiserfsprogs.

-run reiserfsck on the old disk using another computer to completely rule out any hardware issue.

cpthook · January 18, 2017

At this point I would try a couple of things:

-downgrade to v6.2.4, it uses the earlier version of reiserfsprogs.

-run reiserfsck on the old disk using another computer to completely rule out any hardware issue.

[move][glow=red,2,300]SOLVED!!!![/glow][/move]

BINGO!!!!!

johnny-black..... that did it, at least I hope! 80% done and still going. Lucky for me, i hadn't upgraded my 2nd rig to 6.3.x quite yet. I wanted to test for awhile on my 1st one and glad I did since you referenced something called reiserfsprogs. What is that? I have an idea, but i want to know for sure. Ran --rebuild-tree on 2nd rig and its going strong.

Two questions:

1. What does this say about my rig (reference sig rig#1, for specs)? I've ordered a new power supply, RM650x, since this has happened. I had a cheap CX600 series installed from when i originally built the rig back in the day. I've upgraded since due to dockers and VM hosting.

2. What does this say about my rigs compatibility which the beta versions of unRAID? Should I downgrade?

3. What will be the best practice going forward IF the drive is mountable when process is complete? Will unRAID recognize rebuilt drive and allow the array to start like nothing ever happened? Or are there special steps I should follow?

JorgeB · January 18, 2017

Well, since it's running on v6.2 on another server we can't be sure if what helped was the earlier reiserfs-progs or using different hardware, either way I would recommend converting any remaining reiser disks to XFS, reiser has terrible performance with almost full disks and it's not being actively supported since it's author went to jail.

1. What does this say about my rig (reference sig rig#1, for specs)? I've ordered a new power supply, RM650x, since this has happened. I had a cheap CX600 series installed from when i originally built the rig back in the day. I've upgraded since due to dockers and VM hosting.

I'd say it's early for any conclusions, if you get any more issues in the near future then it may be better to downgrade, regarding the PSU, although the CX line is not very well liked here I always had good experiences with them and 600W is enough for you config.

2. What does this say about my rigs compatibility which the beta versions of unRAID? Should I downgrade?

Again, early for any conclusions.

3. What will be the best practice going forward IF the drive is mountable when process is complete? Will unRAID recognize rebuilt drive and allow the array to start like nothing ever happened? Or are there special steps I should follow?

If the repair finishes unRAID should start like nothing happened, but parity will be out of sync, you'll need to do a correcting check.

cpthook · January 18, 2017

If the repair finishes unRAID should start like nothing happened, but parity will be out of sync, you'll need to do a correcting check.

Repair finished / Syncing finished and now the old drive is mountable so I can access files again. But since I removed the old disk while working on file system restore the drive slot is currently marked "Disk2 unassigned". When I assign the old disk, I get blue-ball asking me to rebuild/parity-sync. I don't think I should do that, correct?

Also, could you help me with the --rebuild-tree response I've posted below:

Pass 3 (semantic):
####### Pass 3 #########
/Movies/Family Movies/Janai's Movies/Baby Einstein/BABY_NOAH_US/title00.mkvvpf-10680: The file [626 627] has the wron                                             
/Movies/Hi-DefFlix/RIGHTEOUS KILL/Righteous Kill (2008).mkvvpf-10680: The file [1310 1312] has the wrong block count                                              
Flushing..finished                                                             ohnny Quest                                                                                
Files found: 89721
        Directories found: 45919
        Broken (of files/symlinks/others): 2
Pass 3a (looking for lost dir/files):
####### Pass 3a (lost+found pass) #########
Looking for lost directories:
Flushing..finished e 1, 0 /sec
Pass 4 - finished done 107548, 698 /sec
        Deleted unreachable items 643
Flushing..finished
Syncing..finished
###########
reiserfsck finished at Wed Jan 18 07:01:06 2017

The Lost+Found has no files in it, but I'm concerned about the files listed in the response. Should I assume they are fixed?

JorgeB · January 18, 2017

If you started the array with that disk unassigned you'll need to do a new config, select retain all assignments, check the "parity is already valid" box before starting the array and then run a correcting parity check.

Files should be OK, if you have checksum use them, if not try to play the movies while skipping though a few chapters.

cpthook · January 18, 2017

If you started the array with that disk unassigned you'll need to do a new config, select retain all assignments, check the "parity is already valid" box before starting the array and then run a correcting parity check.

Ok... did new config, retained assignments, marked parity valid and fired things up. Array mounted just fine (including old disk2) and Parity Check (correct) is running; however, almost 10,000 sync errors corrected within the first 30min. Syslog has stopped logging events, but the "corrected" event's that are logged, are in green.

JorgeB · January 18, 2017

That's normal, only thing is that if there are many more sync errors they can significantly slow down the check and it may be faster doing a full sync, but if it's running at or close to normal speed let the check finish.

cpthook · January 18, 2017

That's normal, only thing is that if there are many more sync errors they can significantly slow down the check and it may be faster doing a full sync, but if it's running at or close to normal speed let the check finish.

Ok...home from work. Parity Sync has been running 71/2 hours with 51/2 to go @ avg speed of 65.5 MB/s which is about 10 MB/s below normal. Parity has corrected 18,610 error so far. I'm still trying to investigate the cause of the --rebuild-tree failures. Can you have a look a the items I've highlighted in the attached screen shot? The first describes the point of failure when running the routine on my first rig. The second describes the the exact same spot in the routine that the 2nd rig had no issues with. Is this pointing to memory in any way with rig #1?

JorgeB · January 19, 2017

Can't really say by looking at that, but really doubt it's bad RAM, you have ECC RAM and there's no evidence of any memory related issues on the board event log.

So it's either the newer reiser utils included with v6.3 (reiserfsprogs v 3.6.25, unRAID v6.2 uses v3.6.24) or some hardware issue with the main server, it may be just an incompatibility, not necessarily a malfunction.

You should still have one disk with the file system corruption, if yes upgrade the other server to v6.3 and run reiserfsck on it, that will tell us where the problem is/was.

--rebuild-tree failing Disk2 unmountable

Recommended Posts

cpthook

Link to comment

JorgeB

Link to comment

cpthook

Link to comment

JorgeB

Link to comment

cpthook

Link to comment

cpthook

Link to comment

JorgeB

Link to comment

cpthook

Link to comment

JorgeB

Link to comment

cpthook

Link to comment

JorgeB

Link to comment

cpthook

Link to comment

JorgeB

Link to comment

cpthook

Link to comment

JorgeB

Link to comment

Join the conversation