N7Gabe Posted June 30 Share Posted June 30 (edited) Setup is 15 disks ranging from 1-20TB with 1 parity drive. All disks are connected via a Broadcom LSI 9305-16i HBA, with Mini-SAS to 4x SATA cables. 4 important drives takes place in this: 4TB_FAIL_1: original (~10 year old) drive in disk5 4TB_FAIL_2: original (~10 year old) drive in disk4 20TB_NEW_1: new drive in disk5 20TB_NEW_2: new drive in disk4 Events leading to here: - disk5 (4TB_FAIL_1) got disabled from the array a couple of times. I noticed, that one of the drives had no temperature reading (disk5) but Unraid said it was active and healthy. Also the same disk was present in the unassigned devices list, but got a new drive/device identifier (sdX). Another thing I noticed that real time readings of the system like CPU usage broke, everything showing 0 with values having high numbers in e notation or infinite value. Unraid had zero notifications about this. Array management was unresponsive, so I stopped everything else and shut down the system via the web interface. After starting the system, Unraid now showed that the drive is disabled... (red x). I removed it from the array. Ran every SMART tests, 0 error. Mounted with UD, randomly checked data on it, everything seemed intact, made a full backup of it on one of the bigger drives in the Array. Then I put it back in the same place (disk5) and Unraid rebuilt it without any issues. This repeated every couple of weeks 3-4 times total. - disk5 (4TB_FAIL_1) started producing sector and read errors. Last disabling went like every other, however after system start it now shown that sector errors started to creep up. This time I did not put it back and ordered 2 20TB drives, one to replace this and another for even more storage or a second parity. - Disks arrived, put 20TB_NEW_1 physically in place of 4TB_FAIL_1 and started rebuilding. I noticed that during the rebuild, disk4 (4TB_FAIL_2) now shown as active but without temperature reading and on the UD list. Basically everything the same as with disk5 disablings before. However when I noticed this, the rebuild was after 4TB, so I thought that the drive got disabled after everything was read from it for the rebuild. Rebuild seemed to finish without errors, checked disk5, everything seemed intact, also 20TB_NEW_1 is in good health. disk4 content was emulated, but everything seemed intact as well. I spent a lot of time checking if every data is good and found no errors. 4TB_FAIL_2 completely died out of the blue, from 0 SMART errors. Tried connecting to other pc to rule out HBA/cable issue, no change, drive confirmed dead. - Put 20TB_NEW_2 physically in place of 4TB_FAIL_2 and started rebuilding. During rebuild of disk4 around ~1.5TB disk5 got disabled from the array (similarly as above, green dot, active, but no temp, listed in UD with new id.). It filled the 128MiB of logs (according to System in Dashboard) with a lot of read errors for disk5. Disk4 was now spun down. The rebuild process percentage stuck, but the process did not stop, no notifications again. I tried to pause/cancel the rebuild process, but as usual it was unresponsive... I noticed however that the rest of the disks were still being read as if the rebuild continued (but 0 write on the now spun down disk4 of course). It was late and I had rough days coming up, so I just left it there. ~2 days later I came back and see there is no read/write activity, array management still unresponsive, so I shut down the system via the web interface. Now: Turned the system back on, now Unraid sends notification that rebuild was successful... I start the array, disk5 has red X, drive disabled, content emulated. Disk4 has only the ~1.5TB of data, and now the emulated disk5 only has ~1.5TB there as well. Logs are spewing xfs_metadata corruption, but no identifier (unless md5p1 is an identifier?). Array shows that drives have 4TB content like the old ones had, but in reality that's not true. User shares are gone as well. (???) both disk4 and 5 SMART tested and have 0 errors. Another thing I noticed, that now after drives are spun down, Unraid spins them back up in a minute, (???) regardless of if the array is started. Disk settings still has 15min spin down delay as before. In theory if disk5 (20TB_NEW_1) data is intact and this rebuild didn't mess up parity, disk4 should still be able to be saved. I just need to tell Unraid somehow, that disk5 is ok, and disk4 needs a rebuild. However I don't know how to do that. If disk5 is really gone, maybe even 4TB_FAIL_1 can be put back and save something even with sector errors, if it can survive for 1 more rebuild. I haven't removed the disabled disk5 from the array, to check it's contents because now I'm at a point where I'm not sure what steps might cause data loss. Questions: 1. Is it safe to remove disabled disk5 from the array, to see if it can be mounted with UD and check if data is intact? By safe I mean if it doesn't hurt my chances to restore disk4 content. 2. What is the best course of action to save this situation and restore everything? 3. Is it possible that there is a HBA/cable issue for the slot of disk5? At first I thought 4TB_FAIL_1 was disabled due to it beginning to fail, but it is suspicious that a brand new drive in the same place produced the same symptoms. 4. Why does Unraid always shits itself when it disables a drive from the array? It is very frustrating that the core functionality of the product at it's most critical moment fails... Thanks for taking the time to read through this wall of text! Edited July 1 by N7Gabe Grammar fixes. Quote Link to comment
N7Gabe Posted July 1 Author Share Posted July 1 Here is the diagnostics, right after boot and array start. foundation-diagnostics-20240701-1818.zip Quote Link to comment
JorgeB Posted July 1 Share Posted July 1 Those diags don't show the original errors, which was what I mostly wanted to see, for the current status, check filesystem on the emulated disk5, run it without -n. And don't re-assign disk5 for now, but yes, you can unassign it and mount it with UD, but with the array stopped only. Quote Link to comment
N7Gabe Posted July 1 Author Share Posted July 1 Ran the check on disk5 without -n and it repaired a lot of stuff, guess those were the missing things due to broken rebuild process, also ran it with -n flag on disk4 and it is even worse, but I guess that is expected due to broken rebuild process. Also ran it on a couple of other drives randomly and they have no errors. Not sure what I gained with repairing the emulated disk5 that is missing more than half of its content. If I unassign the disk5's drive, then I won't be able to put it back in the array even for maintenance, without re-assigning, right? Is there anything we might/can do with it while it is in the array in disabled state, or shall I remove it and mount with UD to check if all is well? Quote Link to comment
itimpi Posted July 1 Share Posted July 1 1 minute ago, N7Gabe said: right? Is there anything we might/can do with it while it is in the array in disabled state, or shall I remove it and mount with UD to check if all is well? If it is disabled then Unraid is ignoring it so the fact is assigned is not really relevant. At this point best to see if UD can mount it - that may give a better result than trying to repair the emulated drive. Quote Link to comment
N7Gabe Posted July 1 Author Share Posted July 1 Unassigned the drive under disk5 and mounted with UD, looks like everything is there. Created a share, I can access and read the content without errors. Ran xfs_repair -n on it, and there were no errors either. SMART is good as well. Not sure what else to check. Not sure why the drive got disabled during rebuild. What's next? Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 If the actual disk5 looks better than the emulated one, you can do a new config with it and sync parity instead. Quote Link to comment
N7Gabe Posted July 2 Author Share Posted July 2 Disk5 dropped during disk4 rebuild, so disk4 has filesystem issues and missing data. Due to this, emulated disk5 had filesystem issues and missing data. Hopefully parity was not touched. So disk4 garbage, emulated disk5 garbage, actual unassigned disk5 good, parity hopefully good. Objective is to recover disk4. The only way I see this is rebuild disk4 with disk5 (reassign without rebuild?) and with the help of current parity. Not sure if new config is good for this. I've read the docs and it's not very clear for me, but it seems that would rebuild the parity, so it's not good for me as I need parity to rebuild disk4 with the help of disk5. Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 1 hour ago, N7Gabe said: Disk5 dropped during disk4 rebuild, so disk4 has filesystem issues and missing data. If this happened you should have canceled the rebuild immediately, now it may be too late to recover disk4, but you can still try, note that it will only work if parity is still valid for that config: -Tools -> New Config -> Retain current configuration: All -> Apply -Check all assignments and assign any missing disk(s) if needed -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign disk4 -Start array (in normal mode now) and post new diags. Quote Link to comment
N7Gabe Posted July 2 Author Share Posted July 2 30 minutes ago, JorgeB said: If this happened you should have canceled the rebuild immediately As I detailed in op, I tried but Unraid was not responsive to that and I did not see it happen in real time unfortunately. 32 minutes ago, JorgeB said: -Check all assignments and assign any missing disk(s) if needed This means assign disk5 back basically, right? and keep disk4 in the array in this step? Also as I mentioned in op, my user shares are gone. Is there any chance of restoring them before making a new config, or will I have to recreate them? Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 31 minutes ago, N7Gabe said: This means assign disk5 back basically, right? and keep disk4 in the array in this step? Initially yes, then you unassign disk4. 31 minutes ago, N7Gabe said: Also as I mentioned in op, my user shares are gone. Is there any chance of restoring them before making a new config, or will I have to recreate them? Shares would come back if there are no issues after doing the procedure above. Quote Link to comment
N7Gabe Posted July 2 Author Share Posted July 2 (edited) Done the steps, minor difference was that in new config the option was called "preserve current assignments" and not "retain current configurations". Also I had to check a "Yes I'm sure" when starting the array with disk4 unassigned. I can't browse the missing emulated disk4, not sure if that's intended or that's a sign the data is gone. What's next, reassign disk4 and hope it rebuilds? Edit: user shares are back. foundation-diagnostics-20240702-1352.zip Edited July 2 by N7Gabe Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 Check filesystem for the emulated disk4, run it without -n. Quote Link to comment
N7Gabe Posted July 2 Author Share Posted July 2 Restarted Array in maintenance mode, GUI cannot see a filesystem on emulated disk4, however I can see a /dev/md4p1, I ran xfs_repair -n to check, is this the one I should try to run it on without the -n? Attached the results of commands xfs_repair__n_dev_md4p1.txt Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 11 minutes ago, N7Gabe said: GUI cannot see a filesystem on emulated disk4 Forgot to mention that you need to set the filesystem to xfs first, but you can do it from the CLI, and yes, run it without -n, and if it asks for -L use it. Quote Link to comment
N7Gabe Posted July 2 Author Share Posted July 2 It asked for -L, ran it. It logged a lot but it's done. Next is to restart the array normally and see what happens? Do I need to set the fs before starting the array? Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 1 hour ago, N7Gabe said: Do I need to set the fs before starting the array? No, just start normally, then check contents for the emulated disk4, to see if it's better than before. Quote Link to comment
N7Gabe Posted July 2 Author Share Posted July 2 Now I can browse it, only lost+found folder, but size seems roughly right. Started digging inside, a lot of stuff is there seemingly intact, but some are not (looks like more data is intact than what was rebuilt on disk4 originally when disk5 dropped.) I guess this is the best that can be done. I'll have to manually reorganize the folders, maybe rename files that lost their names, how will I know their names? Now which one is safer, backing up this folder somewhere else in the array while emulated first, or just reassign disk4 so the lost+found folder gets rebuilt into the drive assigned to disk4? I assume backup is safer as I only have to write ~4TB instead of rebuilding that to a 20TB drive. Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 9 minutes ago, N7Gabe said: I assume backup is safer as I only have to write ~4TB instead of rebuilding that to a 20TB drive. Probably. Quote Link to comment
N7Gabe Posted July 3 Author Share Posted July 3 Disk5 got disabled again while copying emulated disk4 content to disk12. I'm starting to suspect that it is related to it's physical location, either data or power cables? Maybe the previous disk5 disablings and sector errors are not related and this diagnostics will shed some light to it. So these are the steps I'm thinking to take: - Stop Array, check disk5 smart, filesystem, content with UD (I expect the same as before, everything good and intact) - Shutdown, switch power or data cables to start to rule them out - Start machine, Tools -> New Config -> Preserve Current Assignments: All -> Apply (so disk5 doesn't get rebuild when array starts and can be used to emulate disk4) - Start Array with "Parity is Valid" and "Maintenance mode" - Check disk4 filesystem, repair if needed. (though it should be ok this time.) - Restart Array normally, check what was copied, continue with rest Is this correct? Thanks, I appreciate the time you spent on helping me out. foundation-diagnostics-20240703-0933.zip Quote Link to comment
JorgeB Posted July 3 Share Posted July 3 It's not reported as a disk problem, so most likely a power/connection issue, you can try different cables or slot, also make sure no excessive splitters are in use. Quote Link to comment
N7Gabe Posted July 3 Author Share Posted July 3 Could you please confirm if the steps I outlined are the correct course of action? Quote Link to comment
JorgeB Posted July 3 Share Posted July 3 Forgot to mention, disk5 wasn't disabled, because there's already a disabled disk, so you just need to check/replace cables and power back up. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.