0rca Posted September 15, 2022 Share Posted September 15, 2022 Hi all, two days ago at night, just before going to bed, I realized that my array was off-line and that one parity and one data disk had been disabled at the same time. I was tired and decided to look after everything the next morning and shut down the server. I know now that that was a mistake, because that way I deleted the syslog I have checked everything I could since then, but can only tell that both drives in question are fine and passed the SMART extended self-test. My question now is, what is the safest way to restore these drives. Do I unassign the data disk first and then rebuild it unto itself? Do I start with the parity, since it is the faster drive and the rebuild will be quicker? Or am I thinking wrong and my situation calls for a completely different approach? I'll attach the diagnostics and hope they are sufficient even though the relevant syslog is overwritten. Thanks in advance for any help. Cheers, Michael deathstar-diagnostics-20220914-0949.zip deathstar-smart-20220915-0944.zip Quote Link to comment
JorgeB Posted September 15, 2022 Share Posted September 15, 2022 Disks look OK and since the emulated disk is mounting, and assuming contents look correct, you can rebuild on top: https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself You can do both at the same time. Quote Link to comment
0rca Posted September 15, 2022 Author Share Posted September 15, 2022 Damn, that was fast Thanks Jorge, I'll do that. I just had to check with you guys first, because this is the first rebuild for my on Unraid and I wanted to make sure. Thanks for the blazingly fast answer. Quote Link to comment
JorgeB Posted September 15, 2022 Share Posted September 15, 2022 You're welcome, and if it happens again don't forget to save the diags before rebooting. Quote Link to comment
0rca Posted September 16, 2022 Author Share Posted September 16, 2022 I am not sure if it is preferred to create a new topic or to post in this one for continuity purposes. But I need help again, this time for real. I had Unraid rebuild the array and after 12 hours everything was back to normal. Yay! This morning I did the update to 6.11.0-rc5 and rebooted (saving the diagnostics beforehand, just in case). Everything worked normally for while and then suddenly at 11:00 am I get read errors an ALL 16 disks. I was out-of-office, so I see this only now. Attached are both diagnostic files, the one from this morning after the update and before the reboot and one I did just now before taking the array off-line. I hope there's no reason for panic and would appreciate any help. Cheers, Michael deathstar-diagnostics-20220916-0912.zip deathstar-diagnostics-20220916-1813.zip Quote Link to comment
JorgeB Posted September 16, 2022 Share Posted September 16, 2022 Sep 16 11:02:30 Deathstar kernel: hpsa 0000:0d:00.0: handle_ioaccel_mode2_error: device is gone! Problem with the RAID controller, reboot/power cycle to see if it comes back then and if yes post new dags after array start. Quote Link to comment
0rca Posted September 16, 2022 Author Share Posted September 16, 2022 Hi Jorge, thanks for helping me again. I rebooted and it came back up. Diags enclosed. deathstar-diagnostics-20220916-1846.zip Quote Link to comment
mathomas3 Posted September 16, 2022 Share Posted September 16, 2022 That's odd Orca... I too also have a parity and data disk drop from the array over night... I rebuilt both of them with some spare disks that I have... Both of the disks SMART report looks ok... Im not going to say that this might be bug unless more reports come up... but timing between your failure and mine is odd tower-diagnostics-20220916-1203.ziptower-diagnostics-20220915-0800.zip Quote Link to comment
Solution JorgeB Posted September 16, 2022 Solution Share Posted September 16, 2022 59 minutes ago, 0rca said: I rebooted and it came back up. Diags enclosed. Everything looks fine, hopefully it was a one time thing, if it happens again I suggest going back to last known good release to see if it's driver/kernel related. Quote Link to comment
0rca Posted September 16, 2022 Author Share Posted September 16, 2022 Thanks Jorge. I'll do that. It could also be a failing HBA (HP Smart Array H240) though, right? I might get another one just in case, it's good to have a spare ready anyway. Quote Link to comment
JorgeB Posted September 16, 2022 Share Posted September 16, 2022 Just now, 0rca said: It could also be a failing HBA (HP Smart Array H240) though, right? It could, could also be overheating, or just check it's well seated or try it in a different PCIe slot. Quote Link to comment
0rca Posted September 16, 2022 Author Share Posted September 16, 2022 Thanks, will check all that. Quote Link to comment
trurl Posted September 16, 2022 Share Posted September 16, 2022 2 hours ago, mathomas3 said: timing between your failure and mine is odd Timing not that odd. Every single day on this forum people have disconnected disks due to hardware problems, often bad connections. Quote Link to comment
mathomas3 Posted September 16, 2022 Share Posted September 16, 2022 3 minutes ago, trurl said: Timing not that odd. Every single day on this forum people have disconnected disks due to hardware problems, often bad connections. We had some shifty power that day... hoping that was all that it was... Ordered a much larger UPS the same day Quote Link to comment
0rca Posted September 17, 2022 Author Share Posted September 17, 2022 Just FYI, It happened again today, all disk showed errors and the HBA was gone. I caught it just in time, went to the basement and measured temps. On the HBA heatsink it showed close to 70 degrees Celsius, so the die temperature would be even higher. Not good. I've added some active cooling to the HBA and booted back up. This time Parity 1 and Disk 4 were disabled. I am now rebuilding, hoping that my cooling is now sufficient. I have two question, to better understand the situation: Is it normal, that two disks (1 parity and 1 data) are disabled, because in that specific moment, when the HBA crashes there's bound to always be one data drive with I/O and one parity or am I simply lucky that it is just two, but could easily be more? Theoretically, what would happen in the latter case? Assuming that only a few bytes might actually be wrong, would there be a way to restore the rest of the data or would I have to copy the data from each drives somewhere else to save it? Having been a raid user for decades, the whole Unraid concept is still new to me.... Quote Link to comment
JorgeB Posted September 18, 2022 Share Posted September 18, 2022 17 hours ago, 0rca said: Is it normal, that two disks (1 parity and 1 data) are disabled, because in that specific moment, when the HBA crashes there's bound to always be one data drive with I/O and one parity or am I simply lucky that it is just two, but could easily be more? It can disable any the disks, parity or data, which one(s) it's luck of the draw, but it won't disabled more disks than there are parity drives. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.