mason Posted September 14, 2018 Share Posted September 14, 2018 (edited) Hey there, I'm currently in the process of expanding my unraid server. I have a 20 (18x4TB + 2x2TB) disk unraid server which is much maxed out. So i bought 2x12TB did a parity check, replaced the parity drive with a 12TB, rebuild went fine. So I swapped the first disc with the next 12TB and its throwing errors on disc11 while rebuilding. Rebuild is still running but doesn't seem to write any more to disk1. So I stopped the rebuild, checked the cables and stuff and tried again. Same errors... also disk1 shows now as unformated, I think this was different on the first try. What do do and how to prevent data loss? I have backed up the config and the flash prior to the disc1 rebuild. Also i was in the pre read progress of cleaning the old 4TB disk1 but stopped it. Is it possible to revert back to my configuration with the old disk1 so i can rescue disc11? I'm a little lost here... what are my options? Edited September 16, 2018 by mason solved Quote Link to comment
JorgeB Posted September 14, 2018 Share Posted September 14, 2018 Disk11 dropped offline, possibly the typical SASLP issue, reboot or power cycle to get it online and post new diagnostics. 1 Quote Link to comment
mason Posted September 14, 2018 Author Share Posted September 14, 2018 (edited) Thanks for the reply jhonnie, I canceled the rebuild, rebooted. server start again to rebuild with disk11 online... like the first 2 times. Looks like disk11 has pending sectors What is the typical SASLP error you mentioned? The server grew 10 years on me with this setup, never had problems with the controller. Okay I found something in the wiki with dropped drives issues on the SASLP with v6 ... I guess I need to throw even more bucks against my server to replace them Edited September 16, 2018 by mason diag removed Quote Link to comment
JorgeB Posted September 14, 2018 Share Posted September 14, 2018 47 minutes ago, mason said: What is the typical SASLP error you mentioned? The server grew 10 years on me with this setup, never had problems with the controller. There are a lot users with the SASLP (and SAS2LP) and dropped disks, it doesn't matter if it always worked since it appears to be worse with latest releases and any hardware or software issue can trigger the problem, but in this case it's a failing disk, so not the controller fault. 1 Quote Link to comment
JorgeB Posted September 14, 2018 Share Posted September 14, 2018 Now for the current problem: Do you still have the original disk1 untouched and is it OK? Is sever data unchanged since replacing it? 1 Quote Link to comment
mason Posted September 14, 2018 Author Share Posted September 14, 2018 Yes, I have the original disk1 which I intended to preclear. Script was already running but I was only some percent into the preread process which I cancled (for my understanding it should be untouched). And I have a static setup, so nothing was written to the server. Quote Link to comment
JorgeB Posted September 14, 2018 Share Posted September 14, 2018 Then you can try this: -Replace disk1 with the original disk, replace disk11 with a new disk (keep old disk11 untouched) -Tools -> New Config -> Retain current configuration: All -> Apply -Assign any missing disk(s), including old disk1 and new disk11 -Important - After checking the assignments leave the browser on that page, the "Main" page. -Open an SSH session/use the console and type: mdcmd set invalidslot 11 29 -Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box, disk11 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check, it's reiser so even if it's unmountable it should be fixable, as long as parity is valid. After this is solved I would still recommend getting those SASLP replaced with LSI controllers, and would definitely recommend adding a second parity disk to an array of that size. 1 Quote Link to comment
mason Posted September 14, 2018 Author Share Posted September 14, 2018 Thanks a lot for your feedback Johnnie! Sounds like a reasonable route to go... will see what I can organize and report back. Currently the rebuild on disk1 is running for 2 hours, it failed the last two times in the first half hour. What would happened if the rebuild will finish successfully? I'm irritated by the unmountable part. Quote Link to comment
JorgeB Posted September 14, 2018 Share Posted September 14, 2018 If disk11 drops offline again, you can cancel since it will just be rebuilding garbage, if there are just a few read errors and it continues you can let it finish so you have more options, though at least part of the rebuilt disk will be corrupt. 1 Quote Link to comment
mason Posted September 14, 2018 Author Share Posted September 14, 2018 (edited) Okay, lets see how far it will run... after that I guess I can compare the content of old disk1 against the rebuild one. Then i would be able to replace disk11 maybe without buying another hdd I won't need... (since i would like to downsize the count of discs anyway with the 12TB route) Thanks a lot for your support! Edited September 14, 2018 by mason added thanks Quote Link to comment
JorgeB Posted September 14, 2018 Share Posted September 14, 2018 14 minutes ago, mason said: Okay, lets see how far it will run... after that I guess I can compare the content of old disk1 against the rebuild one. Then i would be able to replace disk11 maybe without buying another hdd I won't need... Yes, but keep in mind that if there are read errors on disk11 during the rebuild of disk1, even if it finishes, there will be some corruption on the rebuilt disk which means more corruption if you then replace and rebuild disk11, you can still do it but then try to copy every file you can from the old disk11, every file successfully copied can be assumed OK, files that can't be copied should be replaced on the rebuilt disk. 1 Quote Link to comment
mason Posted September 14, 2018 Author Share Posted September 14, 2018 Good point, taken. But with the additional storage on disk1 I might have some emtpy space then to shuffel stuff around. Quote Link to comment
mason Posted September 15, 2018 Author Share Posted September 15, 2018 (edited) The rebuild just finished fine, without a single read error. Unfortunatly after a reebot the server is still showing drive1 as "unmountable no filesystem" like in the screenshot above, could this been an issue due to the 2 failed rebuild?? Edited September 16, 2018 by mason diag removed Quote Link to comment
JorgeB Posted September 15, 2018 Share Posted September 15, 2018 Check filesystem on disk1: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui Quote Link to comment
mason Posted September 16, 2018 Author Share Posted September 16, 2018 Thanks I did that, but for my understanding. Why does this happend if the rebuild starts proper from the beginning and does finish without errors? This is the output of the check, It told me to repair the superblock. Then i did another check, which told me to rebuild-tree. I did that too and it came up with a handful of damaged files, no loss there. So far array looks fine. Can I be confident to go from here? I'm still a little estranged about the whole thing. Guess for any further step I'll wait for the replacement LSI controllers... and might switch to xfs in a longrun. thanks for your help so far Johnnie rebuild-sb: wrong block count occured (2929721331), fixed (2929721328) rebuild-sb: wrong bitmap number occured (1), fixed (0) (really 89408) Reiserfs super block in block 16 on 0x901 of format 3.6 with standard journal Count of blocks on the device: 2929721328 Number of bitmaps: 0 (really uses 89408) Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 1955597332 Root block: 27832367 Filesystem is NOT clean Tree height: 5 Hash function used to sort names: "r5" Objectid map size 790, max 972 Journal parameters: Device [0x0] Magic [0x3036d1ad] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x1: some corruptions exist. sb_version: 2 inode generation number: 39385 UUID: c4f3b705-32f7-4d13-b095-ccb010fe7975 LABEL: Set flags in SB: ATTRIBUTES CLEAN Mount count: 632 Maximum mount count: Disabled. Run fsck.reiserfs(8) or use tunefs.reiserfs(8) to enable. Last fsck run: Never with a version that supports this feature. Check interval in days: Disabled. Run fsck.reiserfs(8) or use tunefs.reiserfs(8) to enable. ---- reiserfsck --check started at Sun Sep 16 07:15:45 2018 ########### Replaying journal: Replaying journal: Done. Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..Bad nodes were found, Semantic pass skipped 1 found corruptions can be fixed only when running with --rebuild-tree ########### reiserfsck finished at Sun Sep 16 08:25:10 2018 ########### Zero bit found in on-disk bitmap after the last valid bit. block 8211: The number of items (8) is incorrect, should be (7) the problem in the internal node occured (8211), whole subtree is skipped vpf-10640: The on-disk and the correct bitmaps differs. ---- Will rebuild the filesystem (/dev/md1) tree Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes Replaying journal: Done. Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed Zero bit found in on-disk bitmap after the last valid bit. Fixed. ########### reiserfsck --rebuild-tree started at Sun Sep 16 09:01:46 2018 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 974123999 blocks marked used Skipping 97618 blocks (super block, journal, bitmaps) 974026381 blocks will be read 0%block 8211: The number of items (8) is incorrect, should be (7) - correctedec block 8211: The free space (64) is incorrect, should be (1256) - corrected ....20%....40%....60%....80%....100% left 0, 54896 /sec 12542 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 974026381 Leaves among those 964911 - corrected leaves 1 pointers in indirect items to wrong area 2 (zeroed) Objectids found 12558 Pass 1 (will try to insert 964911 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....100% left 0, 115 /sec Flushing..finished 964911 leaves read 964818 inserted 93 not inserted ####### Pass 2 ####### Pass 2: 0%....20%....40%....60%....80%....100% left 0, 0 /sec Flushing..finished Leaves inserted item by item 93 Pass 3 (semantic): ####### Pass 3 ######### ... [..]: The file [2 1723] has the wrong block count in the StatData (2290624) - corrected to (2290608) [..]: The directory [2 1720] has the wrong block count in the StatData (6) - corrected to (3) vpf-10650: The directory [2 1720] has the wrong size in the StatData (2624) - corrected to (1504) Flushing..finished Files found: 12165 Directories found: 379 Pass 3a (looking for lost dir/files): ####### Pass 3a (lost+found pass) ######### Looking for lost directories: Looking for lost files:5 /sec Flushing..finishede 0, 0 /sec Objects without names 13 Empty lost dirs removed 2 Files linked to /lost+found 13 Pass 4 - finished done 644856, 77 /sec Deleted unreachable items 2 Flushing..finished Syncing..finished ########### reiserfsck finished at Sun Sep 16 18:45:38 2018 ########### Quote Link to comment
JorgeB Posted September 16, 2018 Share Posted September 16, 2018 The problem usually happens before the rebuild, when Unraid switches to the emulated disk, though most times the emulated disk can pick up just after the disk gets disabled, sometimes there's a little data lost, couple of bits are enough, resulting for example in a corrupt file if a disk gets disabled during a write or some filesystem corruption if metadata is not perfectly updated. You should convert from reiserfs, not because of this, but because it's a dead filesystem and has performance issues with fuller disks. 1 Quote Link to comment
mason Posted September 16, 2018 Author Share Posted September 16, 2018 Great explanation thank you. Yeah I would loved to move to xfs, but since I only always replaced discs with rebuilds I never had the chance. Will defiantly start converting when I'm sorted initially now and have a better confidence. Any more explanations on the performance issue on fuller disks? I have hickups since ages, since my disks are always 99% filled. Sometimes browsing smb takes 30 seconds while more than one file operation is going on... Quote Link to comment
JorgeB Posted September 16, 2018 Share Posted September 16, 2018 Mostly noticeable when you start to write to a share with mostly full reiserfs disks, it can take several seconds before the write starts, in extreme cases even causing Windows file copy to timeout. 1 Quote Link to comment
mason Posted September 16, 2018 Author Share Posted September 16, 2018 Exactly what I am seeing, good to know where it comes from. Again, thank you for your excellent support. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.