hernandito Posted December 18, 2023 Share Posted December 18, 2023 (edited) Hi Team, I am at my wit's end. Over the past month, a good chunk of my drives have been failing one after another. Some drives are new, some are old. I know this is not typical, and something else is wrong. Typically I get one or two of the drives getting the red X. On another unix machine, somehow I have been able to get them to work, either by xfs_repair or mounting and re-mounting. I have shuffled my data around so much into spare new drives. I have also managed to shrink my array by copying data from old smaller drives into new drives. Thinking this could be a symptom of a bad power supply, I replaced it w/ a more powerful one - 1800W. No luck. This past weekend, I managed to complete a new configuration with a complete parity rebuild. This morning I woke up to another red X. I powered off the server, took the drive out, put it in a separate PC and using PartedMagic, I tried to mount w/ no luck. I ran an XFS_Repair... It started w/ attempting to find the second superblock, where you get all the sequential dots... This has never fixed anything in the past, so I canceled. I took the drive and put back in unRAID, and the X is still there, the disk shows as "Not Installed". But in Unassigned Devices it shows up. I am able to mount it and read its contents. This is what has been happening for weeks. I would copy the data to another drive, and try to rebuild everything again... It keeps happening. Any thoughts what could be wrong? I am attaching two Diagnostics, one is from this morning when I woke up, and the second is from right now where the failed drive is mounted, but still shows as "Not Installed". I really need some help. Thank you! H. tower-diagnostics-20231218-0606.zip tower-diagnostics-20231218-0927.zip Edited January 4 by hernandito Quote Link to comment
mathomas3 Posted December 18, 2023 Share Posted December 18, 2023 I havent looked at diags but if you are using SATA drives, that could be the problem... I did run them when I first started but after having a good number of those drives failing I moved on to using SAS... My typical sata drive failures happened every two months running 8 drives at the time Since moving over to SAS I might have one drive fail a year using renewed 8tb drives and that's running 20 drives Something to consider Quote Link to comment
hernandito Posted December 18, 2023 Author Share Posted December 18, 2023 7 minutes ago, mathomas3 said: I havent looked at diags but if you are using SATA drives, that could be the problem... I did run them when I first started but after having a good number of those drives failing I moved on to using SAS... My typical sata drive failures happened every two months running 8 drives at the time Since moving over to SAS I might have one drive fail a year using renewed 8tb drives and that's running 20 drives Something to consider Thanks @mathomas3. I don't think this is related to SATA vs SAS... and I don't think the drives are failing. Making the switch is too expensive. And I have been using SATA drives for over 10 years on my server. I am retiring some Hitachi 4TB even though they are running smoothly. Quote Link to comment
Lolight Posted December 18, 2023 Share Posted December 18, 2023 1 minute ago, hernandito said: I don't think this is related to SATA vs SAS... Yeah, it's an outdated but still a surprisingly common mis-conception that the HDDs build quality and ultimately reliability can be judged by its communication interface. Quote Link to comment
trurl Posted December 18, 2023 Share Posted December 18, 2023 Looks like you must have rebooted before getting diagnostics so we can't see anything about why the drive was disabled. Do you have diagnostics or at least a syslog from before reboot? Quote Link to comment
mathomas3 Posted December 18, 2023 Share Posted December 18, 2023 3 minutes ago, Lolight said: Yeah, it's an outdated but still a surprisingly common mis-conception that the HDDs build quality and ultimately reliability can be judged by its communication interface. I would disagree. I have stacks of then new 3,5,6TB green/red/black drives that failed regularly... Since using SAS(retired data center) drives, my failure rate dropped by a lot and maybe that's the thing. SAS drives are intended for data centers and thus the build quality is better? Anyway you put it, it's just my observations... Back to OP. Are you using a SATA controller card given the number of drives that you are using? Quote Link to comment
Solution itimpi Posted December 18, 2023 Solution Share Posted December 18, 2023 Do you use power splitters to get the power to the drives? I have found these can often cause problems. Quote Link to comment
Lolight Posted December 18, 2023 Share Posted December 18, 2023 4 minutes ago, mathomas3 said: SAS drives are intended for data centers and thus the build quality is better? Anyway you put it, it's just my observations... I really don't want to get into an argument since it will lead to hijacking of this tread by an unrelated topic... I'd suggest creating another thread if you'd like to discuss it. Quote Link to comment
hernandito Posted December 18, 2023 Author Share Posted December 18, 2023 6 minutes ago, itimpi said: Do you use power splitters to get the power to the drives? I have found these can often cause problems. I do...! Let me try adding more to my modular power supply. Please stay tuned. Quote Link to comment
hernandito Posted December 18, 2023 Author Share Posted December 18, 2023 Hi Guys, I have a SuperMrico XDP1 motherboard (Dual Xeon CPUs). The board comes w/ three of the SAS connectors with the 4 SATA cables coming out. I also have a couple of Dell LSI controllers that have been flashed to IT mode. I have tried switching the SATA data cables around between the on-board and the adapter cards w/ the same results. 21 minutes ago, trurl said: Looks like you must have rebooted before getting diagnostics so we can't see anything about why the drive was disabled. Do you have diagnostics or at least a syslog from before reboot? I did upload the diagnostics before my re-boot. I think that the server rebooted itself overnight because this morning the one drive had the X and it was doing an unscheduled Parity check. How can I replicate this? Rebuild the one drive and wait? Thank you guys! H. Quote Link to comment
mathomas3 Posted December 18, 2023 Share Posted December 18, 2023 11 minutes ago, hernandito said: I do...! Let me try adding more to my modular power supply. Please stay tuned. Try this first and if you have another drive go down, I would look into the PSU being the issue. @itimpi Good question... I was going straight to the PSU, I hadnt thought to ask that Quote Link to comment
hernandito Posted December 18, 2023 Author Share Posted December 18, 2023 Just now, mathomas3 said: Try this first and if you have another drive go down, I would look into the PSU being the issue. @itimpi Good question... I was going straight to the PSU, I hadnt thought to ask that I did switch PSU when this started happening... but it did not fix... Going to work on the cabling. Quote Link to comment
itimpi Posted December 18, 2023 Share Posted December 18, 2023 8 minutes ago, hernandito said: I did switch PSU when this started happening... but it did not fix... Going to work on the cabling. If you need to use power splitters, then I have found that Molex->Sata are more reliable than sata -> sata. 1 Quote Link to comment
hernandito Posted December 18, 2023 Author Share Posted December 18, 2023 I swapped back PSU... to my original... a 1600W Platinum rated unit from EVGS (or is it SVGA). I am now using ONLY sata power connectors that came w/ PSU. I have two SSD's that I did split off a molex. Rebuilding the now missing drive. Wish me luck. Quote Link to comment
hernandito Posted January 4 Author Share Posted January 4 It has been about 2+ weeks and everything is running smoothly. Thank yuo @itimpi and @mathomas3 for the solution! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.