Four Drive Failures in 30 days

LumpyCustard · April 30, 2022

Hi,

I'm pulling my hair out trying to figure out if something is wrong or if my luck is complete garbage at the moment.

In the span of 30 days i've had 4 disks fail in my array. Every single failure has been attributed to write errors.

So far i've had a mixture of HGST, Toshiba, Seagate, and Western Digital drives fail.

I have used this opportunity to replace each disk with brand new 12TB drives. The array rebuilds fine and everything works as normal then all of a sudden i'm met with another error. Tonight i was browsing community apps, with the server doing nothing strenuous, and an alert popped up.

The last time this happened i turned off my server, reseated all cables, added the drive back in, and it instantly failed.

The drives that have failed are connected to different HBA's using different SFF-8087 to SATA cables. This isn't the result of a single HBA.

My server is connected to a UPS, and hasn't experienced any improper shutdowns or hardware failures (aside from HDD's) for over a year.

I am at a loss here and having spent $1,720AUD on HDD's in the last 30 days i want to figure this out before i relent and buy another drive.

I have attached my diagnostics. Any advice or input would be sincerely appreciated.

System specs:

CPU: Ryzen 3600X

HBA: Dell Perc H310 (these cards were salvaged from dell servers and flashed myself. the HBA's are also modded with 40mm fans to keep them cool)

Mobo: Gigabyte B450 AORUS ELITE

RAM: 32GB DDR4 (non ECC)

Disks: See attached screenshot for drive list.

PSU: 850w seasonic

If anyone can provide any assistance, clarity, or advice, it would be sincerely appreciated.

devoraid-diagnostics-20220501-0105.zip

Edited April 30, 2022 by LumpyCustard

itimpi · April 30, 2022

Are you sure you do not have any power related issues?

Frank1940 · April 30, 2022

1 hour ago, LumpyCustard said:

I am at a loss here and having spent $1,720AUD on HDD's in the last 30 days i want to figure this out before i relent and buy another drive.

If you still have any of these drives, I would like to suggest that you test them using either Preclear (Docker or plugin on your Unraid system) or the Hard Drive manufacturer's Diagnostic Tool on your PC. There are a lot of reason why hard disks end up being 'disabled' and a high percentage of those causes are not the result of defective/bad disks.

LumpyCustard · April 30, 2022

28 minutes ago, itimpi said:

Are you sure you do not have any power related issues?

Do you suppose that I might have too many drives on a single SATA power rail?

If that were the case wouldn’t the server become very unstable during the multiple 24+ hour rebuilds that have taken place 4 times this month?

8 minutes ago, Frank1940 said:

If you still have any of these drives, I would like to suggest that you test them using either Preclear (Docker or plugin on your Unraid system) or the Hard Drive manufacturer's Diagnostic Tool on your PC. There are a lot of reason why hard disks end up being 'disabled' and a high percentage of those causes are not the result of defective/bad disks.

Is it acceptable to test these drives using a powered USB-C HDD caddy plugged into my server via unassigned devices?

My Define 7XL is completely full and I have no means of mounting these HDD’s for testing.

Edited April 30, 2022 by LumpyCustard

trurl · April 30, 2022

So you have been buying new drives when it wasn't clear there was any actual drive problem? Did any disks have SMART warnings on the Dashboard page?

A disk gets disabled when a write to it fails for ANY reason. This is because the failed write makes it out-of-sync with the array, since parity is still updated. After a disk is disabled, it is emulated by the rest of the array from the parity calculation and the disk itself isn't used. Writes are emulated by updating parity. The initial failed write and any subsequent writes to the emulated disk can be recovered by rebuilding.

57 minutes ago, LumpyCustard said:

Do you suppose that I might have too many drives on a single SATA power rail?

How many? Any splitters? Bad connections, whether power or SATA, are by far a more common reason for disabled disks than actual disk problems.

58 minutes ago, LumpyCustard said:

test these drives using a powered USB-C HDD caddy plugged into my server via unassigned devices?

Depends on the specific hardware. Some won't pass SMART information, which is the first thing you should look at before even running a test.

Do you have another computer you could test them in?

Frank1940 · April 30, 2022

2 hours ago, LumpyCustard said:

If that were the case wouldn’t the server become very unstable during the multiple 24+ hour rebuilds that have taken place 4 times this month?

Actually, it is the current surge that is required when the drives start up that is the problem. The peak current can be as high as three amperes on the +12V buss. The running current once the drive has reached steady state is a fraction of one ampere. Problem is sometimes the IxR drop power supply connectors if splitters are used

LumpyCustard · May 1, 2022

8 hours ago, trurl said:

So you have been buying new drives when it wasn't clear there was any actual drive problem? Did any disks have SMART warnings on the Dashboard page?

Shut down server.

Reseat cables.

Boot up server.

Assign disk into array.

Start rebuild.

Rebuild fails in 10 seconds with write errors.

So then i buy a new disk, remove the old one, insert the new one (using the same power and sata cable that the bad disk used) and successfully rebuild the array and the server runs for a few more weeks without issue... hence why i just accept the fact that the disk has gone bad.

8 hours ago, trurl said:

How many? Any splitters? Bad connections, whether power or SATA, are by far a more common reason for disabled disks than actual disk problems.

14 disks on two power rails using Silverstone SATA splitters: https://www.mwave.com.au/product/silverstone-4in1-sata-power-connectors-sstcp06-ab54614

8 hours ago, trurl said:

Depends on the specific hardware. Some won't pass SMART information, which is the first thing you should look at before even running a test.

Do you have another computer you could test them in?

Would it be worthwhile to boot my server into a PE environment like Hirens and run diagnostics on the disk while it's plugged into my current server?

5 hours ago, Frank1940 said:

Actually, it is the current surge that is required when the drives start up that is the problem. The peak current can be as high as three amperes on the +12V buss. The running current once the drive has reached steady state is a fraction of one ampere. Problem is sometimes the IxR drop power supply connectors if splitters are used

That makes sense, i have read about the initial power surge associated with spinning up so many disks at once.

Why would this issue rear its head after the server has been successfully turned on and the array started?

You can see in the screenshot in my first post that the server had been up for 3 days 16 hours.

I am in the process of re-cabling my server right now to try and split up the disks across as many power rails as possible.

Edited May 1, 2022 by LumpyCustard

LumpyCustard · May 1, 2022

Alright i've disconnected all HDD's, connected the "bad" HDD and booted up into Hirens to check SMART info. Unfortunately i can't upload screenshots since the server is running in a PE environment so please excuse the pictures.

I've run a short self test and it passed, and all SMART info seems to be fine, aside from it just being an old drive with a lot of power on hours.

Edited May 1, 2022 by LumpyCustard

LumpyCustard · May 1, 2022

Alright after testing for a while and the drive passing all tests, as well as testing the exact sectors that unraid flagged, i've decided to replace the PSU in my server.

I'm getting a Corsair 750m with 14x SATA connectors as opposed to my current PSU that has 8 that are being split over 16 drives.

I'm headed to the computer shop now and will install and test by tonight to rule this issue out once and for all.

JonathanM · May 1, 2022

Make sure you remove and replace ALL the power cables. Modular power supplies can have cables that look identical but are deadly to equipment if used with the wrong PSU box.

LumpyCustard · May 1, 2022

Yes absolutely, can't mix and match modular cables unless you want a fire. The Corsair PSU comes with 14 connectors out of the box.

LumpyCustard · May 1, 2022

$280, a new PSU and CPU cooler later i've reinstalled everything.

I've connected 4 or 5 HDD's per cable without the need for splitters except for the last 4 HDD's which i've used a splitter to connect 2 SSD's. The power draw of the SSD's should be negligible.

Server is rebuilding using the "failed" disk now, will report back with results.

Edited May 1, 2022 by LumpyCustard

LumpyCustard · May 2, 2022

Final update, after rebuilding overnight the "failed" disk has passed and the server is now functional.

As mentioned above i figured that if there were power issues (too much draw), the problem would show itself at startup when all the disks attempt to spin up, rather than manifesting in drive failures a week after the server had been booted.

Regardless i'll monitor and report back. Thanks for the input.

Edited May 2, 2022 by LumpyCustard

Four Drive Failures in 30 days

Recommended Posts

LumpyCustard

Link to comment

itimpi

Link to comment

Frank1940

Link to comment

LumpyCustard

Link to comment

trurl

Link to comment

Frank1940

Link to comment

LumpyCustard

Link to comment

LumpyCustard

Link to comment

LumpyCustard

Link to comment

JonathanM

Link to comment

LumpyCustard

Link to comment

LumpyCustard

Link to comment

LumpyCustard

Link to comment

Join the conversation