[solved] 6.5.3 - Help, Failure while rebuild

mason · September 14, 2018

Hey there,

I'm currently in the process of expanding my unraid server. I have a 20 (18x4TB + 2x2TB) disk unraid server which is much maxed out. So i bought 2x12TB did a parity check, replaced the parity drive with a 12TB, rebuild went fine. So I swapped the first disc with the next 12TB and its throwing errors on disc11 while rebuilding. Rebuild is still running but doesn't seem to write any more to disk1. So I stopped the rebuild, checked the cables and stuff and tried again. Same errors... also disk1 shows now as unformated, I think this was different on the first try.

What do do and how to prevent data loss? I have backed up the config and the flash prior to the disc1 rebuild. Also i was in the pre read progress of cleaning the old 4TB disk1 but stopped it. Is it possible to revert back to my configuration with the old disk1 so i can rescue disc11?

I'm a little lost here... what are my options?

Edited September 16, 2018 by mason
solved

JorgeB · September 14, 2018

Disk11 dropped offline, possibly the typical SASLP issue, reboot or power cycle to get it online and post new diagnostics.

mason · September 14, 2018

Thanks for the reply jhonnie, I canceled the rebuild, rebooted. server start again to rebuild with disk11 online... like the first 2 times.

Looks like disk11 has pending sectors

What is the typical SASLP error you mentioned? The server grew 10 years on me with this setup, never had problems with the controller.

Okay I found something in the wiki with dropped drives issues on the SASLP with v6 ... I guess I need to throw even more bucks against my server to replace them

Edited September 16, 2018 by mason
diag removed

JorgeB · September 14, 2018

47 minutes ago, mason said:

What is the typical SASLP error you mentioned? The server grew 10 years on me with this setup, never had problems with the controller.

There are a lot users with the SASLP (and SAS2LP) and dropped disks, it doesn't matter if it always worked since it appears to be worse with latest releases and any hardware or software issue can trigger the problem, but in this case it's a failing disk, so not the controller fault.

JorgeB · September 14, 2018

Now for the current problem:

Do you still have the original disk1 untouched and is it OK?

Is sever data unchanged since replacing it?

mason · September 14, 2018

Yes, I have the original disk1 which I intended to preclear. Script was already running but I was only some percent into the preread process which I cancled (for my understanding it should be untouched).

And I have a static setup, so nothing was written to the server.

JorgeB · September 14, 2018

Then you can try this:

-Replace disk1 with the original disk, replace disk11 with a new disk (keep old disk11 untouched)

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s), including old disk1 and new disk11
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type:

mdcmd set invalidslot 11 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box, disk11 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check, it's reiser so even if it's unmountable it should be fixable, as long as parity is valid.

After this is solved I would still recommend getting those SASLP replaced with LSI controllers, and would definitely recommend adding a second parity disk to an array of that size.

mason · September 14, 2018

Thanks a lot for your feedback Johnnie! Sounds like a reasonable route to go... will see what I can organize and report back.

Currently the rebuild on disk1 is running for 2 hours, it failed the last two times in the first half hour. What would happened if the rebuild will finish successfully? I'm irritated by the unmountable part.

JorgeB · September 14, 2018

If disk11 drops offline again, you can cancel since it will just be rebuilding garbage, if there are just a few read errors and it continues you can let it finish so you have more options, though at least part of the rebuilt disk will be corrupt.

mason · September 14, 2018

Okay, lets see how far it will run... after that I guess I can compare the content of old disk1 against the rebuild one. Then i would be able to replace disk11 maybe without buying another hdd I won't need... (since i would like to downsize the count of discs anyway with the 12TB route)

Thanks a lot for your support!

Edited September 14, 2018 by mason
added thanks

JorgeB · September 14, 2018

14 minutes ago, mason said:

Okay, lets see how far it will run... after that I guess I can compare the content of old disk1 against the rebuild one. Then i would be able to replace disk11 maybe without buying another hdd I won't need...

Yes, but keep in mind that if there are read errors on disk11 during the rebuild of disk1, even if it finishes, there will be some corruption on the rebuilt disk which means more corruption if you then replace and rebuild disk11, you can still do it but then try to copy every file you can from the old disk11, every file successfully copied can be assumed OK, files that can't be copied should be replaced on the rebuilt disk.

mason · September 14, 2018

Good point, taken. But with the additional storage on disk1 I might have some emtpy space then to shuffel stuff around.

mason · September 15, 2018

The rebuild just finished fine, without a single read error. Unfortunatly after a reebot the server is still showing drive1 as "unmountable no filesystem" like in the screenshot above, could this been an issue due to the 2 failed rebuild??

Edited September 16, 2018 by mason
diag removed

JorgeB · September 15, 2018

Check filesystem on disk1:

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

mason · September 16, 2018

Thanks I did that, but for my understanding. Why does this happend if the rebuild starts proper from the beginning and does finish without errors?

This is the output of the check, It told me to repair the superblock. Then i did another check, which told me to rebuild-tree. I did that too and it came up with a handful of damaged files, no loss there. So far array looks fine. Can I be confident to go from here? I'm still a little estranged about the whole thing.

Guess for any further step I'll wait for the replacement LSI controllers... and might switch to xfs in a longrun.

thanks for your help so far Johnnie

rebuild-sb: wrong block count occured (2929721331), fixed (2929721328)
rebuild-sb: wrong bitmap number occured (1), fixed (0) (really 89408)
Reiserfs super block in block 16 on 0x901 of format 3.6 with standard journal
Count of blocks on the device: 2929721328
Number of bitmaps: 0 (really uses 89408)
Blocksize: 4096
Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 1955597332
Root block: 27832367
Filesystem is NOT clean
Tree height: 5
Hash function used to sort names: "r5"
Objectid map size 790, max 972
Journal parameters:
        Device [0x0]
        Magic [0x3036d1ad]
        Size 8193 blocks (including 1 for journal header) (first block 18)
        Max transaction length 1024 blocks
        Max batch size 900 blocks
        Max commit age 30
Blocks reserved by journal: 0
Fs state field: 0x1:
         some corruptions exist.
sb_version: 2
inode generation number: 39385
UUID: c4f3b705-32f7-4d13-b095-ccb010fe7975
LABEL:
Set flags in SB:
        ATTRIBUTES CLEAN
Mount count: 632
Maximum mount count: Disabled. Run fsck.reiserfs(8) or use tunefs.reiserfs(8) to enable.
Last fsck run: Never with a version that supports this feature.
Check interval in days: Disabled. Run fsck.reiserfs(8) or use tunefs.reiserfs(8) to enable.
----
reiserfsck --check started at Sun Sep 16 07:15:45 2018
###########
Replaying journal: 
Replaying journal: Done.
Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..  finished
Comparing bitmaps..Bad nodes were found, Semantic pass skipped
1 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Sun Sep 16 08:25:10 2018
###########
Zero bit found in on-disk bitmap after the last valid bit.
block 8211: The number of items (8) is incorrect, should be (7)
 the problem in the internal node occured (8211), whole subtree is skipped
vpf-10640: The on-disk and the correct bitmaps differs.
----
Will rebuild the filesystem (/dev/md1) tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
Replaying journal: Done.
Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed
Zero bit found in on-disk bitmap after the last valid bit. Fixed.
###########
reiserfsck --rebuild-tree started at Sun Sep 16 09:01:46 2018
###########
Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 974123999 blocks marked used
Skipping 97618 blocks (super block, journal, bitmaps) 974026381 blocks will be read
0%block 8211: The number of items (8) is incorrect, should be (7) - correctedec
block 8211: The free space (64) is incorrect, should be (1256) - corrected
....20%....40%....60%....80%....100%                       left 0, 54896 /sec
12542 directory entries were hashed with "r5" hash.
        "r5" hash is selected
Flushing..finished
        Read blocks (but not data blocks) 974026381
                Leaves among those 964911
                        - corrected leaves 1
                pointers in indirect items to wrong area 2 (zeroed)
                Objectids found 12558

Pass 1 (will try to insert 964911 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....100%                         left 0, 115 /sec
Flushing..finished
        964911 leaves read
                964818 inserted
                93 not inserted
####### Pass 2 #######
Pass 2:
0%....20%....40%....60%....80%....100%                           left 0, 0 /sec
Flushing..finished
        Leaves inserted item by item 93
Pass 3 (semantic):
####### Pass 3 #########
... [..]: The file [2 1723] has the wrong block count in the StatData (2290624) - corrected to (2290608)
[..]: The directory [2 1720] has the wrong block count in the StatData (6) - corrected to (3)
vpf-10650: The directory [2 1720] has the wrong size in the StatData (2624) - corrected to (1504)
Flushing..finished
        Files found: 12165
        Directories found: 379
Pass 3a (looking for lost dir/files):
####### Pass 3a (lost+found pass) #########
Looking for lost directories:
Looking for lost files:5 /sec
Flushing..finishede 0, 0 /sec
        Objects without names 13
        Empty lost dirs removed 2
        Files linked to /lost+found 13
Pass 4 - finished done 644856, 77 /sec
        Deleted unreachable items 2
Flushing..finished
Syncing..finished
###########
reiserfsck finished at Sun Sep 16 18:45:38 2018
###########

JorgeB · September 16, 2018

The problem usually happens before the rebuild, when Unraid switches to the emulated disk, though most times the emulated disk can pick up just after the disk gets disabled, sometimes there's a little data lost, couple of bits are enough, resulting for example in a corrupt file if a disk gets disabled during a write or some filesystem corruption if metadata is not perfectly updated.

You should convert from reiserfs, not because of this, but because it's a dead filesystem and has performance issues with fuller disks.

mason · September 16, 2018

Great explanation thank you. Yeah I would loved to move to xfs, but since I only always replaced discs with rebuilds I never had the chance.
Will defiantly start converting when I'm sorted initially now and have a better confidence.

Any more explanations on the performance issue on fuller disks? I have hickups since ages, since my disks are always 99% filled. Sometimes browsing smb takes 30 seconds while more than one file operation is going on...

JorgeB · September 16, 2018

Mostly noticeable when you start to write to a share with mostly full reiserfs disks, it can take several seconds before the write starts, in extreme cases even causing Windows file copy to timeout.

mason · September 16, 2018

Exactly what I am seeing, good to know where it comes from. Again, thank you for your excellent support.

[solved] 6.5.3 - Help, Failure while rebuild

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation