Jenfil Posted March 17 Share Posted March 17 Hi everyone, It's been a month now that I have been trying the most basic thing, get my array setup. Haven't even reached creating shares because of this and I'm getting very disappointed in the process. Sometimes it's the parity disk that doesn't activate, other times it's a drive that gets unassigned, it's simply horrible. I've attached screenshots and diagnostic file. I have so far: - reset the config twice, with reboots in between. - ran smart tests multiple times,sometimes all passed, sometimes it got stuck at 10% and had to reboot - ran preclear, got it to run successfully on almost all drives, but it also crashes midway and at 40h per drive I gave up - formatted, rebooted, rebuilt the array, rebooted, started the array, never any luck. I have nothing yet on my disks, so I'm not loosing anytime except a lot of time, but this is getting very frustrating. The problem is, with the smart tests failing to run, I can't even tell if I have a bad disk. The thing is, the disks came from a QNAP that was running fine, so I wasn't expecting so much trouble. Maybe the next thing I can think of, is removing the disks and running a smart test on another computer? unraid-diagnostics-20240317-0932.zip Quote Link to comment
Frank1940 Posted March 17 Share Posted March 17 I would start by test RAM. (There is a boot option to check non-ECC RAM in Unraid's Boot Menu.) Quote Link to comment
ConnerVT Posted March 17 Share Posted March 17 Gut feeling it is related to your LSI SAS3008 controller. But I'll let the experts with these see if they agree and offer advice. Quote Link to comment
JorgeB Posted March 17 Share Posted March 17 Lots of these: Mar 17 06:23:44 Unraid kernel: sd 4:0:1:0: Power-on or device reset occurred These usually mean a power/connection issue, check/replace cables/PSU. Quote Link to comment
Jenfil Posted March 17 Author Share Posted March 17 1 hour ago, JorgeB said: Lots of these: Mar 17 06:23:44 Unraid kernel: sd 4:0:1:0: Power-on or device reset occurred These usually mean a power/connection issue, check/replace cables/PSU. That's my manual intervention when everything got stuck during a pre-clear, PSUs are fine 2 hours ago, ConnerVT said: Gut feeling it is related to your LSI SAS3008 controller. But I'll let the experts with these see if they agree and offer advice. iDRAC says everything is fine disk wise, any idea on what to check? 5 hours ago, Frank1940 said: I would start by test RAM. (There is a boot option to check non-ECC RAM in Unraid's Boot Menu.) Not sure how my RAM could be linked to this, I indeed skipped the MEM check, having 128Gb of RAM and fearing the 5 days of checks while still figuring out how to get things running I started hot swapping disks to get smart tests separately on another computer, identified 2 with READ errors, but the other 4 are fine. Unraid didn't have an issue with the 2 errored disks. The parity disk and the 9VV didn't raise a smart error... Quote Link to comment
JorgeB Posted March 17 Share Posted March 17 9 minutes ago, Jenfil said: That's my manual intervention when everything got stuck during a pre-clear, PSUs are fine There are a lot of these, during at least 30 minutes, including just before the disk errors, were you intervening during all that time? And what exactly do mean by intervening? Were you unplugging cables with the server on while reading/writing to the disks? That will cause errors for sure. Quote Link to comment
Jenfil Posted March 18 Author Share Posted March 18 2 hours ago, JorgeB said: There are a lot of these, during at least 30 minutes, including just before the disk errors, were you intervening during all that time? And what exactly do mean by intervening? Were you unplugging cables with the server on while reading/writing to the disks? That will cause errors for sure. Apologies, just noticed those entries, no that wasn't me. I've checked the server logs on lifecycle, it doesn't show anything of the sort. Last hard reset was on March 5th, there are 2 redundant power supplies, no way anything shut down without me doing something. Looks like you found something else that's pretty odd Quote Link to comment
JorgeB Posted March 18 Share Posted March 18 Could be just a SATA cable issue, since both disks are sowing UDMA CRC errors, especially parity. Quote Link to comment
Jenfil Posted March 20 Author Share Posted March 20 On 3/18/2024 at 5:33 AM, JorgeB said: Could be just a SATA cable issue, since both disks are sowing UDMA CRC errors, especially parity. Switched the drives to other bays no difference. I did a smart test on all drives separately, the 74CC threw a read error, nothing on the others Started over with a new config ,and it's completely different and completely random, look at the screenshot. I can't format the disks 1 and 4, disk 2 keeps erroring out. Quote Link to comment
Jenfil Posted March 20 Author Share Posted March 20 Ha yes, more logic, started a new config, just repeated the array Quote Link to comment
Jenfil Posted March 21 Author Share Posted March 21 Here you go, thanks for the help with this, I'm at a complete loss unraid-diagnostics-20240320-2137.zip Quote Link to comment
trurl Posted March 21 Share Posted March 21 Have you tried reseating controller? Is it overheating? Quote Link to comment
JorgeB Posted March 21 Share Posted March 21 Still a lot of errors with the devices, in my experience this is bad SATA or power connection to the disk(s), could also be a PSU or controller issue, but much less likely, is this a server with trays or are the disks connected directly with minSAS cables? Quote Link to comment
Jenfil Posted March 22 Author Share Posted March 22 13 hours ago, JorgeB said: Still a lot of errors with the devices, in my experience this is bad SATA or power connection to the disk(s), could also be a PSU or controller issue, but much less likely, is this a server with trays or are the disks connected directly with minSAS cables? Yes it's a dell r730xd with 12 trays. a backplane issue you tihnk? Quote ave you tried reseating controller? Is it overheating? Temps don't pass 27C Quote Link to comment
JorgeB Posted March 22 Share Posted March 22 8 hours ago, Jenfil said: a backplane issue you tihnk? Could be, or a problem with the cables that go from the HBA to the backplane. Quote Link to comment
Jenfil Posted March 22 Author Share Posted March 22 oh man that doesn't sound good, before I start ordering parts, is it worth trying with another distro like Truenas or something, just to rule out any software issue? Quote Link to comment
JorgeB Posted March 22 Share Posted March 22 I don't see how those errors would be an OS issue, but you can try Truenas, but look at the logs/zpool status, a redundant zfs pool can recover more easily, and you may miss the errors as the Unraid array will report. Quote Link to comment
Jenfil Posted March 29 Author Share Posted March 29 Well tried with Truenas, which gave me smart errors on 2 disks, so got rid of those. Then swapped the backplane board and I don't have the power errors anymore. Now all that's left are those I/O errors, so I'll try the PERC cable once it arrives. There is light at the end of the tunnel! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.