Drives Keep Failing - SOLVED


Go to solution Solved by itimpi,

Recommended Posts

Hi Team,

 

I am at my wit's end.

 

Over the past month, a good chunk of my drives have been failing one after another. Some drives are new, some are old. I know this is not typical, and something else is wrong.

 

Typically I get one or two of the drives getting the red X.  On another unix machine, somehow I have been able to get them to work, either by xfs_repair or mounting and re-mounting. I have shuffled my data around so much into spare new drives. I have also managed to shrink my array by copying data from old smaller drives into new drives. 

 

Thinking this could be a symptom of a bad power supply, I replaced it w/ a more powerful one - 1800W. No luck. This past weekend, I managed to complete a new configuration with a complete parity rebuild.

 

This morning I woke up to another red X. I powered off the server, took the drive out, put it in a separate PC and using PartedMagic, I tried to mount w/ no luck. I ran an XFS_Repair... It started w/ attempting to find the second superblock, where you get all the sequential dots... This has never fixed anything in the past, so I canceled. I took the drive and put back in unRAID, and the X is still there, the disk shows as "Not Installed". But in Unassigned Devices it shows up. I am able to mount it and read its contents. This is what has been happening for weeks. I would copy the data to another drive, and try to rebuild everything again... It keeps happening.

 

Any thoughts what could be wrong? I am attaching two Diagnostics, one is from this morning when I woke up, and the second is from right now where the failed drive is mounted, but still shows as "Not Installed".

 

I really need some help.

 

Thank you!

 

H.

 

tower-diagnostics-20231218-0606.zip tower-diagnostics-20231218-0927.zip

Edited by hernandito
Link to comment

I havent looked at diags but if you are using SATA drives, that could be the problem... I did run them when I first started but after having a good number of those drives failing I moved on to using SAS... 

 

My typical sata drive failures happened every two months running 8 drives at the time

 

Since moving over to SAS I might have one drive fail a year using renewed 8tb drives and that's running 20 drives

 

Something to consider

Link to comment
7 minutes ago, mathomas3 said:

I havent looked at diags but if you are using SATA drives, that could be the problem... I did run them when I first started but after having a good number of those drives failing I moved on to using SAS... 

 

My typical sata drive failures happened every two months running 8 drives at the time

 

Since moving over to SAS I might have one drive fail a year using renewed 8tb drives and that's running 20 drives

 

Something to consider

Thanks @mathomas3. I don't think this is related to SATA vs SAS... and I don't think the drives are failing. Making the switch is too expensive. And I have been using SATA drives for over 10 years on my server. I am retiring some Hitachi 4TB even though they are running smoothly.

Link to comment
3 minutes ago, Lolight said:

Yeah, it's an outdated but still a surprisingly common mis-conception that the HDDs build quality and ultimately reliability can be judged by its communication interface.

I would disagree. I have stacks of then new 3,5,6TB green/red/black drives that failed regularly... Since using SAS(retired data center) drives, my failure rate dropped by a lot and maybe that's the thing. SAS drives are intended for data centers and thus the build quality is better? Anyway you put it, it's just my observations... 

 

Back to OP. Are you using a SATA controller card given the number of drives that you are using?

Link to comment
4 minutes ago, mathomas3 said:

SAS drives are intended for data centers and thus the build quality is better? Anyway you put it, it's just my observations...

I really don't want to get into an argument since it will lead to hijacking of this tread by an unrelated topic...

I'd suggest creating another thread if you'd like to discuss it.

Link to comment

Hi Guys,

 

I have a SuperMrico XDP1 motherboard (Dual Xeon CPUs). The board comes w/ three of the SAS connectors with the 4 SATA cables coming out.

image.png.b8cfc5ca744898486ffebc3a29cc51b4.png

I also have a couple of Dell LSI controllers that have been flashed to IT mode. I have tried switching the SATA data cables around between the on-board and the adapter cards w/ the same results.

 

21 minutes ago, trurl said:

Looks like you must have rebooted before getting diagnostics so we can't see anything about why the drive was disabled. Do you have diagnostics or at least a syslog from before reboot?

 I did upload the diagnostics before my re-boot. I think that the server rebooted itself overnight because this morning the one drive had the X and it was doing an unscheduled  Parity check. How can I replicate this? Rebuild the one drive and wait?

 

Thank you guys!

 

H.

Link to comment
Just now, mathomas3 said:

Try this first and if you have another drive go down, I would look into the PSU being the issue. @itimpi Good question... I was going straight to the PSU, I hadnt thought to ask that :)

 

I did switch PSU when this started happening... but it did not fix...  Going to work on the cabling.

Link to comment
  • 3 weeks later...
  • hernandito changed the title to Drives Keep Failing - SOLVED

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.