Caennanu Posted June 22, 2022 Share Posted June 22, 2022 Gday all, Before i start my story, at time of writing i'm a little salty, so excuse the negative undertone. And a disclaimer: In previous posts i have mentioned having Machine Check Events from ECC memory. This was caused by a defective memory module and has been replaced. System was running fine for 2 - 3 weeks since replacing the module. While replacing the module i have also replaced 3 intertech Fans (1200rpm pwm) for Noctua NF-F12 industrial PPC 3000rpm fans. The question i'm looking to get answered: Where did i go wrong and how can i prevent this in the future? Last monday i noticed 2 of my drives had failed in my 5 disk array with 2 parity. All WD Red 4TB drives. Which i found a bit odd. So first thing i did was check if the parity hadn't failed. Which it didn't seem too. The disks were being emulated. Pffew, parity saved my ass....... orso i thought. Data was showing on the network, albeit a bit slow, i was able to acces it. So i ordered 2 new drives. And started troubleshooting. Checked all cables and connections. All was fine. VM's and Dockers were running from the cache disk. so atleast that hadn't failed. CCTV recordings in a 2nd pool, mirrored had no issues. All WD disks were attached to the onboard SAS controller of my Asrock rack EPYCD8-2T. The purple disks for my CCTV were also connected to the same controller, but on the 2nd port / breakout cable. So we wait for the replacement disks. 2x 4TB seagate drives. Figured i would take a different brand, making it easier to recognize. Disks come in, turn the server off. Replace the supposedly failed disks and start the machine up. Array started offline, as requested, so i could add the 2 new disks to the parity and start rebuilding. Rebuilding starts, data is still available thru emulation. And then... error galore. every bit read is being errored. ohh shhhh....t something is wrong. I let it run for a bit longer, figuring maybe its a parity thing, but rebuild drive speed drops from 80mb/ps to the kbs..... Alright, this isn't right. something else must be wrong. At this moment i also noticed that data that was on 1 disk (my collection of movies i gathered over the years) was no longer showing. So i hastely make an backup of my most important files to an USB disk, just to be save. So these are save. Turn off the machine and check all the connections once again. Nothing seemed out of place, but something must be wrong. So since all effected disks where in a HDD tray caddy (for hot swap purposes), i figured maybe something is wrong with that thing. I remove the caddy, place all the drives in a regular hdd tray, connect them directly to the power and data cables and boot the machine back up. Re-starting the parity build, by swapping the new disks per slot and formatting them. After about 2 hours, i reach the 500GB mark where previously all the errors started appearing. No errors this time. The parity rebuild continues, all seems fine. After about 5 hours, something happens. The one data disk that wasn't malfunctioning earlier, now all a sudden has errors and is being emulated. The parity build process changes in speed, from 80mb/ps on average, to 1,2gb/ps, 2.3gb/ps, 4.6gb/ps and even topping out around 8gb/ps. Now we all know, sata drives can be fast. But no way spinning disks on Sata are THIS fast. But the rebuild completes shortly after that. Since i had a hunch, i checked the network drive... Even tho accessible, only the shares on the cache are still present. All while the parity disks are up, as well as the 2 new disks, and then the disabled 3rd disk..... So at this point, i basically figured i lost all my data. So . . . we reboot. Cause . . . i haven't tried turning it off and on again. right? Now, unraid wants the same disk on position 1 and 3? Daf...q? When trying to force the correct disk, it gives me the 'wrong' message. Turn the machine off. Hook all drives up to my HBA which only uses 1 breakout for my SSD's connected for VM's and aren't showing any issues. And turn it back on. Hoping this would solve the issue. But the same thing happened.... crap.... Well, then there is only 1 thing left to do right? New config... So we do new config. assign all drives where they belong. Start the array, and start rebuilding.... And this is where i am at right now. No disks are being emulated. Only the shares on my cache are showing and unraid is clearing disk 2 and having little to no data on disk 1 and 3. But, i'm still having a little bit of hope.... According to the header, i'm still using about 34% of my array, which was the state before all things happened. So now we wait for disk2 to clear and hopefully rebuild data from the parity, but i'm not thinking ill be that lucky. So . . . will i get my data back, any data? Where did i go wrong in troubleshooting? What should i have done, maybe a different order, different troubleshooting steps. Cause if i don't learn now, ill never be able to trust this setup again. Quote Link to comment
PeteAsking Posted June 22, 2022 Share Posted June 22, 2022 As an aside, since each disk is independent in order to try get data back you can just plug a "failed disk" into another computer and see if any of the data is readable on it. That may help you in getting some data back. Quote Link to comment
Caennanu Posted June 22, 2022 Author Share Posted June 22, 2022 Just now, PeteAsking said: As an aside, since each disk is independent in order to try get data back you can just plug a "failed disk" into another computer and see if any of the data is readable on it. That may help you in getting some data back. I only have windows machines tho. Don't think they read XFS? Quote Link to comment
PeteAsking Posted June 22, 2022 Share Posted June 22, 2022 I mean you should download a live cd like even ubuntu live cd and then its irrelevant if you boot off that what your OS on a different pc is. Quote Link to comment
Caennanu Posted June 22, 2022 Author Share Posted June 22, 2022 Just now, PeteAsking said: I mean you should download a live cd like even ubuntu live cd and then its irrelevant if you boot off that what your OS on a different pc is. Allright, ill give that a shot with the 2 failed drives i have laying abouts. They shouldn't have had anything happen to them other than what i described above. Maybe i get lucky. 1 Quote Link to comment
PeteAsking Posted June 22, 2022 Share Posted June 22, 2022 Also when you choose new config I think It wont delete data on a disk so if a disk had data on it it will still be available also so probably you are not in such a bad state as you might believe. Quote Link to comment
PeteAsking Posted June 22, 2022 Share Posted June 22, 2022 Just now, Caennanu said: Allright, ill give that a shot with the 2 failed drives i have laying abouts. They shouldn't have had anything happen to them other than what i described above. Maybe i get lucky. No worries I believe you should be ok so dont stress too much Quote Link to comment
Caennanu Posted June 22, 2022 Author Share Posted June 22, 2022 Just now, PeteAsking said: Also when you choose new config I think It wont delete data on a disk so if a disk had data on it it will still be available also so probably you are not in such a bad state as you might believe. That is what i thought. But why is it clearing the disk now then? Clearing the disk to me sounds like a format thing? Quote Link to comment
Caennanu Posted June 22, 2022 Author Share Posted June 22, 2022 Just now, PeteAsking said: No worries I believe you should be ok so dont stress too much Yeah.... Tell the other half that, who's missing all of her, oh so important pictures of the sky, and shoes, and what not 1 Quote Link to comment
PeteAsking Posted June 22, 2022 Share Posted June 22, 2022 Just now, Caennanu said: That is what i thought. But why is it clearing the disk now then? Clearing the disk to me sounds like a format thing? I must admit I forget the exact option when doing a new config, it could be there is a checkbox to tell it not to format or perhaps its just clearing the new disks, this I am less sure about. Hopefully someone more experienced chimes in but I do think you shouldnt give up on your data just yet. Quote Link to comment
PeteAsking Posted June 22, 2022 Share Posted June 22, 2022 Also people here are helpful so probably if you take it slow and let people ask the questions they need and not rush, then probably if the data is available somehow you will get it back. It sounds super scary but if you are calm this will provide the best chance of success and it sounds like there are certainly some things to try before giving up hope 1 Quote Link to comment
Caennanu Posted June 22, 2022 Author Share Posted June 22, 2022 4 minutes ago, PeteAsking said: Also people here are helpful so probably if you take it slow and let people ask the questions they need and not rush, then probably if the data is available somehow you will get it back. It sounds super scary but if you are calm this will provide the best chance of success and it sounds like there are certainly some things to try before giving up hope Hope is all i got. Ill try not to respond too quickly ;0 1 Quote Link to comment
JonathanM Posted June 22, 2022 Share Posted June 22, 2022 1 hour ago, Caennanu said: swapping the new disks per slot and formatting them Formatting is NEVER part of a recovery. The emulation includes the file system, so when you format, you tell Unraid to empty the filesystem. Hopefully you can read all the individual disks and recover your data, it doesn't really sound like you had real disk failures, more failure to read or write. Posting diagnostics may shed some light on what's currently available, but since you've rebooted multiple times since this started, the original errors that started this are lost. Quote Link to comment
Caennanu Posted June 22, 2022 Author Share Posted June 22, 2022 Just now, JonathanM said: Formatting is NEVER part of a recovery. The emulation includes the file system, so when you format, you tell Unraid to empty the filesystem. Hopefully you can read all the individual disks and recover your data, it doesn't really sound like you had real disk failures, more failure to read or write. Posting diagnostics may shed some light on what's currently available, but since you've rebooted multiple times since this started, the original errors that started this are lost. Aye, thats what i thought. But the disks were unmountable (forgot to mention, sorry, been a hectic day), thus i had to format them Would there have been another option when they became unmountable because of a missing file system? Yeah, in terms of recovery. i'm currently using recovery software for the original failed disks. It seems to be working, but scan is still in progress (for the next 7 hours...). And that is what i'm thinking currently. Or atleast hoping for, that it was simply an I/O issue, and that i can recover most of it. Quote Link to comment
itimpi Posted June 22, 2022 Share Posted June 22, 2022 24 minutes ago, Caennanu said: Would there have been another option when they became unmountable because of a missing file system? Yes! It is covered here in the online documentations accessible via the ‘Manual’ link at the bottom of the GUI. 1 Quote Link to comment
ChatNoir Posted June 23, 2022 Share Posted June 23, 2022 10 hours ago, Caennanu said: Where did i go wrong and how can i prevent this in the future? If it seems strange and you are unsure of what to do : breath do nothing ask on the forums (and attach your diagnostics) 1 Quote Link to comment
Caennanu Posted June 23, 2022 Author Share Posted June 23, 2022 (edited) 8 hours ago, itimpi said: Yes! It is covered here in the online documentations accessible via the ‘Manual’ link at the bottom of the GUI. Thank you. If i had followed the advice of ChatNoir, i probably would have found that.... And good timing at that. The disk that had no issues so far and was clearing yesterday has just finished .. . . as you geussed it, unmountable. Edited June 23, 2022 by Caennanu Quote Link to comment
Caennanu Posted June 23, 2022 Author Share Posted June 23, 2022 (edited) so, i'm trying to follow the guide @itimpi but i run into an issue at step 4. i do not have the 'check filesystem status' button? (i know the entire array was done in XFS, so following that thread) And taking the array out of maintenance mode. i get the following Edited June 23, 2022 by Caennanu Quote Link to comment
Caennanu Posted June 23, 2022 Author Share Posted June 23, 2022 Running the manual check now. Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 Please post current diagnostics. Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 11 hours ago, Caennanu said: The parity build process changes in speed, from 80mb/ps on average, to 1,2gb/ps, 2.3gb/ps, 4.6gb/ps and even topping out around 8gb/ps. FYI this will happen if Unraid is beyond the array redundancy, e..g, if you get errors on another disk with two parity drives but have already two disabled disks Unraid has no way of continuing the rebuild, so to not corrupt the disk (if it's a previous disk that is being rebuilt on top) it just skips ahead while there are multiple disk errors, hence the speed, it's just skipping, not actually writing anything. Quote Link to comment
Caennanu Posted June 23, 2022 Author Share Posted June 23, 2022 (edited) 7 minutes ago, JorgeB said: FYI this will happen if Unraid is beyond the array redundancy, e..g, if you get errors on another disk with two parity drives but have already two disabled disks Unraid has no way of continuing the rebuild, so to not corrupt the disk (if it's a previous disk that is being rebuilt on top) it just skips ahead while there are multiple disk errors, hence the speed, it's just skipping, not actually writing anything. Alright that makes sence. Diagnostics comming soon. --- Diagnostics added bigboii-diagnostics-20220623-0902.zip Edited June 23, 2022 by Caennanu Added Diagnostics Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 Disk2 was added as a new disk to the array and it was cleared, that writes zeros to the disk, overwriting any data. Quote Link to comment
Caennanu Posted June 23, 2022 Author Share Posted June 23, 2022 4 minutes ago, JorgeB said: Disk2 was added as a new disk to the array and it was cleared, that writes zeros to the disk, overwriting any data. So . . . data recovery time. FML. Quote Link to comment
Caennanu Posted June 24, 2022 Author Share Posted June 24, 2022 Does anyone have any recommendations for recovery software i can use on either ubuntu live booted from usb or via windows, where i hook up the drives via an HDD dock? So far, i've been able to recover a decent amount of files. But, there's no folder structure present at all, its all individual files. Also, the softwares seem to read my VM image files internally, instead of giving me the image file itsself? Since i had a spare key for EaseUS laying around, i'm using that. Installed on a windows laptop that i can run 24/7 for the time being. But i had the same 'issue' with hetman linux recovery software and recoverit. (did only the scans with the latter 2, since . .. well limitations on free software) Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.