Bizarro Posted July 30, 2016 Share Posted July 30, 2016 Hi guys, I've just upgraded my parity drive from 3TB to 4TB, and around 1/3 of the way through the generation of the parity, one of the data drives has died. It still shows in Unraid, I can see all the files, but it gets a ton of read errors and I can't copy anything off it. I then tried to pop my original 3TB parity drive in, but it says it says the replacement drive needs to be the same size, so I can't use it. I haven't written much to the array since swapping out parity drives. Is there a way to do the following: Install old 3TB parity drive (all other disks are 3TB or less) Remove dying data drive Power up the Unraid box Get it to "trust" the old parity drive and let me start the array in a "missing disk" scenario, so I can recover the 1.5TB of data using parity information? After I have recovered the data, I plan to pop the 4TB parity drive back in, and use the old 3TB parity drive to replace the dying data disk. I had a look here: http://lime-technology.com/wiki/index.php/Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily But it doesn't really state that it should be used in this case. Running one of the very last Unraid 5 versions. Quote Link to comment
trurl Posted July 30, 2016 Share Posted July 30, 2016 ... I haven't written much to the array since swapping out parity drives... If you have written anything at all to the array since swapping parity drives then the old parity is invalid and cannot be used to rebuild a data disk. It is possible that the drive hasn't really died, but instead you have a connection problem caused when you swapped the parity. Does the drive actually have a redball? Post SMART for the drive and a syslog. Quote Link to comment
Bizarro Posted July 30, 2016 Author Share Posted July 30, 2016 The drive hasn't redballed, but it gets a ton of read errors during parity rebuild. It's taken all week to get to 50%, and it was running at 200kBytes/sec when I powered down the NAS. I currently have the dying drive in an external USB enclosure, attempting to get data off using some Linux reader and some other tools. It's also erroring out when attached to the PC. I'd acknowledge some of the parity info would not be correct, but the majority should be. I'd assume the last dozen files might be damaged due to this but the rest of the parity info would still be OK? I don't have the drive in the NAS right now so I can't post logs. Just wondering if there is a way to fire up the array with the drive missing and the old parity drive installed. Quote Link to comment
Bizarro Posted July 30, 2016 Author Share Posted July 30, 2016 OK I have popped in the old parity drive, temporarily installed the dying drive, followed the instructions on the link above, ie: Take screenshot of current drive arrangement Telnet in and type: initconfig Go to website and reassign all the drives Tick "trust array" In the telnet prompt: mdcmd set invalidslot 99 Start array Close webpage Yank dying drive out Copy data off the "virtual" disk14 (the dying drive) Recovered data appears to be intact so far Quote Link to comment
172pilot Posted July 30, 2016 Share Posted July 30, 2016 This is a really scary scenario to me, that I hope I never have to deal with, but "just in case" I'd love to know whether in the end, your process worked for you, and if you'd do something different given the same situation.. Please post back once you're done!! Unraid is so awesome that I hate to criticize in any way, but it would be really cool to be able to either have a "raid 6" type double parity, or to be able to configure a new parity disk while the original parity disk is still online and usable... Quote Link to comment
JorgeB Posted July 30, 2016 Share Posted July 30, 2016 Unraid is so awesome that I hate to criticize in any way, but it would be really cool to be able to either have a "raid 6" type double parity, or to be able to configure a new parity disk while the original parity disk is still online and usable... V6.2 supports dual parity. Quote Link to comment
Bizarro Posted July 30, 2016 Author Share Posted July 30, 2016 It copied successfully for a few hours and I managed to rescue a stack of files I was unable to before. I walked back in and now the share for disk 14 is showing as blank. Maybe it hit some out-of-sync data on the parity drive and crapped itself. Rebooting the box now to see if I can start the array with the missing disk and recover the remaining data. I've stuck with this version of Unraid and Plex forever, but the double-parity option really appeals to me now! Quote Link to comment
JorgeB Posted July 30, 2016 Share Posted July 30, 2016 Maybe it hit some out-of-sync data on the parity drive and crapped itself. Since parity wasn't completely synced it's normal for that disk to have file system issues. Quote Link to comment
Bizarro Posted July 30, 2016 Author Share Posted July 30, 2016 Rebooting with disk14 unplugged results on disk14 showing as "not installed" rather than missing. Can't browse to \\nas\disk14 either, and can't browse disk14 via the web interface either. Might reboot again and yank the drive after the array starts again. Not sure how to get the array to present the "virtual" disk14 Quote Link to comment
Bizarro Posted July 30, 2016 Author Share Posted July 30, 2016 Rebooting just gives the choice of wiping disk14 or rebuilding it. Might pop to the computer shop tomorrow and grab a 3TB drive and take the rebuild option. The array just doesn't seem to want to present a "virtual" disk14 unless anyone knows a trick? Quote Link to comment
itimpi Posted July 30, 2016 Share Posted July 30, 2016 Rebooting just gives the choice of wiping disk14 or rebuilding it. Might pop to the computer shop tomorrow and grab a 3TB drive and take the rebuild option. The array just doesn't seem to want to present a "virtual" disk14 unless anyone knows a trick? you might want to post a screenshot to make it clearer what the current state is. Sounds like the 'virtual' disk may have some file system level corruption that will need fixing. Rebuilding a disk with file system level corruption at the virtual level just gives a physical disk with the same corruption. Quote Link to comment
trurl Posted July 30, 2016 Share Posted July 30, 2016 Not really relevant now, but just thought I would comment on this bit: ... I'd assume the last dozen files might be damaged due to this but the rest of the parity info would still be OK? If you knew for certain that the files written since the old parity was in sync were only on the damaged disk, then what you say would be true. Any files written to other disks though would have an unknown affect on the rebuilt disk, since the bits on those disks would also be involved in reconstructing the data on the damaged disk, and there is no telling where on the damaged disk those out-of-sync bits would be. Quote Link to comment
Bizarro Posted July 31, 2016 Author Share Posted July 31, 2016 Hi guys, I understand a few files will be unrecoverable, probably 50GB or so that were written before the parity drive was upgraded. But I'd assume the parity for the remaining 1.4TB of sectors should still be accurate. I just can't seem to see a way to get the "virtual" parity drive to show. Maybe I'll need to do the "trust my array" procedure a second time? Here's a few screenshots: Array with dying disk not installed: Array Start dialog box with disk not installed - doing so doesn't give me a virtual disk 14: Array with dying disk (14) present: Array Start dialog box with options presented. Notice I can't just start it as is. SMART test results: A quick snippet of the disk test log retrieved when I click "collect": ATA Error Count: 42265 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 42265 occurred at disk power-on lifetime: 22116 hours (921 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:13:14.151 READ DMA EXT 25 00 08 ff ff ff ef 00 00:13:14.144 READ DMA EXT 25 00 08 ff ff ff ef 00 00:13:14.127 READ DMA EXT ef 10 02 00 00 00 a0 00 00:13:14.126 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:13:14.126 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] Error 42264 occurred at disk power-on lifetime: 22116 hours (921 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:13:11.508 READ DMA EXT ef 10 02 00 00 00 a0 00 00:13:11.507 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:13:11.507 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:13:11.506 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:13:11.506 SET FEATURES [set transfer mode] Error 42263 occurred at disk power-on lifetime: 22116 hours (921 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 00:13:08.764 READ DMA EXT ca 00 08 c0 00 00 e0 00 00:13:08.764 WRITE DMA c8 00 38 a8 0c 00 e0 00 00:13:08.726 READ DMA c8 00 08 a0 0c 00 e0 00 00:13:08.726 READ DMA c8 00 08 98 0c 00 e0 00 00:13:08.726 READ DMA Error 42262 occurred at disk power-on lifetime: 22115 hours (921 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 03 08 ff ff ff ef 00 02:31:59.657 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:59.060 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:56.426 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:53.798 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:53.206 READ DMA EXT Error 42261 occurred at disk power-on lifetime: 22115 hours (921 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 03 08 ff ff ff ef 00 02:31:56.426 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:53.798 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:53.206 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:52.609 READ DMA EXT 25 03 08 ff ff ff ef 00 02:31:52.017 READ DMA EXT Quote Link to comment
Bizarro Posted July 31, 2016 Author Share Posted July 31, 2016 OK I have just run the trust my array procedure, but without "mdcmd set invalidslot 99" oops, what does this do? Either way I am recovering more data, going to run down the road now and try and source another drive if I can find a PC shop open on a Sunday. Quote Link to comment
Bizarro Posted July 31, 2016 Author Share Posted July 31, 2016 Lots of these in the log as I do the recovery, I'm guessing due to the incomplete parity. Looks like it has frozen so I'll have to reboot and continue the process. Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Quote Link to comment
Bizarro Posted July 31, 2016 Author Share Posted July 31, 2016 Looks like my motherboard or CPU has crapped itself just to add to the party, it boots, but at glacial pace (I'm talking 1 hr just to get to the unraid boot screen), almost like the CPU is running at 1MHz or something. Even with all SATA, PCI and excess RAM sticks removed it does the same thing Tried a spare motherboard out of an old Acer i3 PC, but it didn't seem to support USB boot. Any recommendations for a motherboard with lots (6+) SATA ports? <insert photo of me tearing hair out here> Quote Link to comment
bonienl Posted July 31, 2016 Share Posted July 31, 2016 Most modern motherboards have 6 or more SATA ports, you have a lot of choice. I would recommend to move to unRAID 6 when you start using new hardware. Quote Link to comment
Bizarro Posted July 31, 2016 Author Share Posted July 31, 2016 I've pulled the CPU off and replaced the i3 540 with an i3 550 off the spare motherboard. Noticed a bent CPU socket pin from when I originally built the box years ago and a missing CPU socket pin next to it on the motherboard, straightened out the bent pin and it seems to be running at normal speed again, even with the other pin missing. Might still investigate a new motherboard though. Quote Link to comment
Bizarro Posted August 1, 2016 Author Share Posted August 1, 2016 OK I managed to get my hands on an old Intel SB3240GP with Xeon X3440 CPU on it. http://www.intel.com/content/www/us/en/support/server-products/server-boards/single-socket-server-boards/intel-server-board-s3420gp-family.html All up and running, have installed new disk to replace the faulty one. Clicked rebuild data. It is rebuilding the data but I also have a choice to format the disk? I also can't see disk14 in the list of shares I can connect to. Should I just wait it out? What happens if it does a data rebuild when the disk isn't formatted first? Also looks like all the newer data was copied to disk15 and disk16, so I'm hoping the parity disk swap doesn't affect what is on those. I managed to get 1.4TB out of the 1.5TB of data back. Quote Link to comment
RobJ Posted August 2, 2016 Share Posted August 2, 2016 Lots of these in the log as I do the recovery, I'm guessing due to the incomplete parity. Looks like it has frozen so I'll have to reboot and continue the process. Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck? These are file system errors, not disk errors or parity errors, nothing to do with parity. Once you have access to Disk 14 again, see Check Disk File systems. All up and running, have installed new disk to replace the faulty one. Clicked rebuild data. It is rebuilding the data but I also have a choice to format the disk? I also can't see disk14 in the list of shares I can connect to. Should I just wait it out? What happens if it does a data rebuild when the disk isn't formatted first? You don't want to format Disk 14, that would empty the disk, and update the parity accordingly. Rebuilding a drive rewrites all of the bits on the drive, overwriting whatever was on the drive before. Quote Link to comment
Bizarro Posted August 2, 2016 Author Share Posted August 2, 2016 Even though Disk14 has been replaced? Doesn't it need to format the disk before it can restore the data to it via parity? I let it run for a few hours and it got to around 700GB in, but there was no disk14 share still. And it still wanted me to format it. I hadn't checked your reply, I cancelled the rebuild, hit Format, and then clicked Rebuild again. It is rebuilding again now, but the disk14 share is showing as blank. Disk also shows as blank in the list of disks, see below. Will the data reappear on that disk when the Rebuild completes? Or have I just nuked all the data out of parity? It seems to be writing something to the disk, but who knows what? I have a copy of most of the data just in case, but I was hoping this would restore the data more nicely Quote Link to comment
trurl Posted August 2, 2016 Share Posted August 2, 2016 As RobJ said You don't want to format Disk 14, that would empty the disk, and update the parity accordingly. Rebuilding a drive rewrites all of the bits on the drive, overwriting whatever was on the drive before. The word format means "Write an empty filesystem to this disk". That is what it has always meant on every operating system you have ever used. unRAID treats this write operation just like it does every other write operation, by updating parity. So now parity agrees that you have an empty filesystem, and now you are rebuilding an empty filesystem. Format is NEVER part of the rebuild process. Quote Link to comment
Bizarro Posted August 2, 2016 Author Share Posted August 2, 2016 Damn! How do I reverse/stop this? Or have I already lost that parity information? It's not the end of the world as I recovered most of the data via parity before I popped this new replacement disk in. Why does it say "Data-Rebuild" when it isn't rebuilding anything? Quote Link to comment
trurl Posted August 2, 2016 Share Posted August 2, 2016 Damn! How do I reverse/stop this? Or have I already lost that parity information? It's not the end of the world as I recovered most of the data via parity before I popped this new replacement disk in. Why does it say "Data-Rebuild" when it isn't rebuilding anything? It is rebuilding the disk. The disk is just a bunch of bits. It will rebuild the entire disk. Unfortunately, the bits on the disk now represent an empty filesystem. The parity information now agrees that the disk has an empty filesystem on it, and that is what it is rebuilding. An empty filesystem is not an empty disk. The only thing that sort of corresponds to the idea of an empty disk is a cleared disk. A cleared disk, such as can be created with preclear, is a disk full of zeros. A filesystem, even an empty filesystem, has the filesystem metadata that represents folders, where files are stored, etc. In the case of a newly formatted disk, which is the empty filesystem you wrote when you told it to format, you basically have the metadata that represents the top level folder, with no files or folders in it. However, the empty filesystem metadata is a pretty small amount of the bits on the disk. All of the other bits that used to be there are still there (and also in the parity array), so if you let the rebuild complete all of those other bits will still be there. It might be possible to recover some of the files from those other bits using filesystem repair tools. Since you say you have backups, it might be easier and more reliable to just restore from the backups. If that is what you want to do I would say go ahead and let the rebuild complete, so the filesystem will be empty and it will also be consistent with parity. Then you can just copy everything from your backups. Quote Link to comment
Bizarro Posted August 2, 2016 Author Share Posted August 2, 2016 Sounds like a plan, thanks heaps for your help Worst case I'll plug the old dying drive into a caddy, attach to PC and use some tools to recover the remaining 100GB from it, assuming that data is not on bad sectors. Eventually I need to replace all the remaining Seagate ST3000DM001 drives as these have a nightmarish failure rate! https://www.backblaze.com/blog/hard-drive-reliability-q4-2015/ Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.