Hard disks keep failing!! <SOLVED>


Recommended Posts

Hi All

 

I hope someone can help with this...

 

My faithful Unraid Server (4.7 Server Plus) has been working for a fair while now with few problems.

 

The other day I found one hard disk was showing as failed (2Tb Hitachi). No problems - I ordered a new one and while waiting for it to come in I removed the failed disk from the array under devices and then re-added it. Immediately Unraid spotted it and gave it a blue spot and tried to build the array back into it. As it had shown as failed I stopped it and fitted the new disk when it arrived. I did a pre-clear on it and installed it in the array and let it rebuild.

 

Then a few days ago the new disk is now showing as failed - it has a red spot against the drive. I noticed it because I tried to access a file and it was missing - when I went into the Unraid GUI I could see a red spot against my new disk and another disk was showing a fault (cant remember what it showed), after rebooting the array came back on line with only the one failed disk and my data was accessible again.

 

This was a couple of weeks again but have done nothing yet as I cant believe a new disc could fail so quickly and 2Tb disks are not cheap enough to just keep buying up on a whim.

 

The fact that this new drive (5 weeks old) is showing as failed in slot 2 which is the slot which the original disk was in - do you think this could be a software problem?

 

I know it could be a genuine failure - but find it hard to believe.

 

Maybe I should do a pre-clear and rebuild into it again and see what happens?

 

Any thoughts gratefully received.

 

Thanks

Link to comment

A drive could fail at any time. It could be entirely coincidental. However, I would try switching data cables prior to the next rebuild. If it still shows as failed, switch power cables. If it fails again, switch data ports. This is all assuming the disk passes a long preclear run and SMART test.

Link to comment

What sort of hardware? Are you using backplanes? Could be a bad one.

 

Could also be bad/loose sata/power cable. Definitely check that.

 

Also, how many drives? underpowered/failing PSU can cause odd disk drop outs as well...

 

It's not impossible that 2 drives failed one after another, but not super likely.

 

How many pre-clear cycles? I've had a drive pass the first, but fail the second... Others have also seen this happen. I personally run pre clear three times for every new drive...

 

You might want to post smart reports of the drives, as well as a syslog. It would help significantly. I'm not very proficient at reading them, but it would help one of the gurus in identifying  the issue... :)

Link to comment

Thanks for the pointers - from a hardware point of view I am using a 5-bay IcyBox IB-555SSK, SAS/SATA II hard drive caddy which was bought brand new when I built the server about 2 years ago. The PSU was bought specifically for the project and is a Corsair CMPSU-450VX (450W) model which has worked brilliantly.

 

I did recently have a mainboard fail and have just (in the last few weeks - not sure if its prior to the original disk failure) replaced the mainboard with a Gigabyte P35C-DSR3.

 

The disks in the array at the moment are;

 

2Tb Parity Disk - Seagate ST32000542AS

Disk 1 - 1Tb Samsung - HD105SI

Disk 2 - 2Tb Western Digital WD20EARX - This is the slot that has failed twice quite quickly

Disk 3 - 1Tb Western Digital WD10EARS

Disk 4 - 1Tb Western Digital WD10EARS

Disk 5 - 2Tb Western Digital WD20EARS

 

I think I will close it down - check the cables havent worked loose and run a 3 cycle pre-clear on disk 2 once I re-introduce it to the array and see how it goes from there.

 

Any more help appreciated.

 

Thanks

Link to comment

Thanks Mr Hexen - I am not sure what HPA is, but as soon as the pre-clear is done I'll have a hunt in BIOS to ensure its turned off.

 

Seanant - originally the first disk to "fail" was my Hitachi 0F12117 which is only a year old... this I put down to a failure and replaced with the Western Digital WD20EARX. This one is obviously in slot 2 as it replaced the failed Hitachi... and I am just sceptical as it is showing as failed within a few weeks of being installed.

 

Having Googled HPA & Unraid I think Mr Hexen may be on to something - enough so that I am wondering whether I would be better off changing my board again for another make?

 

Thanks again

Link to comment

Thanks dgaschk - but I dont really want to have to replace the CPU and maybe RAM as I am using a socket 775 CPU.

 

I have just bought (Ebay) the Abit AB9 Pro mobo as this seems to be well tested as I dont want the hassle and risk of the Gigabyte board screwing my data up - so hopefully in a few days (the preclear may have finished by then!) I will be able to swap the mobo out and see if things improve :)

Link to comment

You should be able to see if HPA is at work. First of all, you can see if it's enabled in the BIOS settings. Secondly, if it's active, one or more of your disks would be reporting a different size than it should, due to the HPA file.

 

But it sounds like you've already ordered another board, so it's probably moot at this point.

Link to comment

Hi Mr Hexen

 

I gotta say I love the Gigabyte board - but am not too sure how to turn off HPA - I have read a little and there seems to be many different pointers on how to disable - if it is just turn off in BIOS then I may stick with it and keep the AB9 as a spare. But I have also read that you have to reset partitions on the HDDs etc... which sounds like a burden.

 

The preclear finished on my "failed" 2Tb disk last night - 3 cycles all clear - so I reintroduced it to the array and it has rebuilt and I have all green lights - so I am sure what you are saying about HPA is correct :) Thanks!

 

If you can let me know how to disable the HPA I would be grateful as I would like to keep the Gigabyte board if possible. Also now I know about this I guess I probably have another 2Tb disk which can be precleared and re-insterted in the array to increase my space :D

 

Thansk again!

Link to comment

Hi Mr Hexen

 

I gotta say I love the Gigabyte board - but am not too sure how to turn off HPA - I have read a little and there seems to be many different pointers on how to disable - if it is just turn off in BIOS then I may stick with it and keep the AB9 as a spare. But I have also read that you have to reset partitions on the HDDs etc... which sounds like a burden.

 

The preclear finished on my "failed" 2Tb disk last night - 3 cycles all clear - so I reintroduced it to the array and it has rebuilt and I have all green lights - so I am sure what you are saying about HPA is correct :) Thanks!

 

If you can let me know how to disable the HPA I would be grateful as I would like to keep the Gigabyte board if possible. Also now I know about this I guess I probably have another 2Tb disk which can be precleared and re-insterted in the array to increase my space :D

 

Thansk again!

 

Couple of points. If you do have HPA on one of your drives, you'll need to get rid of it, whether you stick with the Gigabyte board or not. If you switch mobos, it'll still be there. Also, have you confirmed you have it, as per an earlier post? (One drive will be listed in the GUI as a tiny bit smaller then other drives of the same size).

 

Also, if you don't need the space, you might want to run a 3 preclear cycle on the extra drive, and just keep it aside to swap out down the road if when one of your drives fail (as they always eventually do). Either run it as a warm spare (plugged in, but not assigned), or leave it unattached and only hook it up when necessary. There's some great posts on the pros and cons of warm spares. Just do a quick forum search :).

Link to comment

Thanks DoeBoye

 

I'll check when I get home - but I did read something like this and I checked... all my 1Tb disks show identical sizes and all my 2Tb disks show identical sizes... so hopefully this will mean that there is no HPA partition (or suchlike) on any of the disks... is there any way to be sure?

 

I'll log in to the BIOS tonight too and see if I can find any settings relating to HPA which I can disable - anyone knowing where to find these please do let me know where they are on this board.

 

Also - great idea to pre-clear a 2Tb disk and hold it as a warm spare... especially as pre-clear takes 74 hours (ish) on a 2Tb disk!

 

Thanks again

Link to comment

Thanks DoeBoye

 

I'll check when I get home - but I did read something like this and I checked... all my 1Tb disks show identical sizes and all my 2Tb disks show identical sizes... so hopefully this will mean that there is no HPA partition (or suchlike) on any of the disks... is there any way to be sure?

 

I'll log in to the BIOS tonight too and see if I can find any settings relating to HPA which I can disable - anyone knowing where to find these please do let me know where they are on this board.

 

Also - great idea to pre-clear a 2Tb disk and hold it as a warm spare... especially as pre-clear takes 74 hours (ish) on a 2Tb disk!

 

Thanks again

 

Correct! If all your drives are the same size, you do not have HPA on any of them. My guess is it was probably already off in your bios by default :). I'm 99.99% sure if HPA is enabled, your mobo writes the data to one of the drives the first time the system boots. As they are all the same size, no data was written at boot, therefore HPA is off for sure :).

 

I can't claim the idea as mine, but it is certainly good practice to have a spare, warm or cold. It makes disaster recovery much less painful :).

Link to comment

HPA would not cause your drives to report as failed. I believe the mr-hexen was merely letting you know that it could also be an issue with your system :).

 

A drive that reports as failed but is not is usually due to a communication problem between the drive and UnRaid. This problem, most commonly, is caused by loose/bad connections, underpowered/faulty power supplies, bad backplanes/sata ports. I'm not saying that's an exhaustive list, but it probably covers the majority of the possibilities.

 

 

Link to comment

OK - I have had a look in BIOS and cant find any reference to HPA (nless its under another term) and all the disks are reporting in Unraid as;

 

1Tb = 976,762,552

2Tb = 1,953,514,522

 

So looks like no HPA here - but I did notice a setting in BIOS for RAID which is set to Disabled and there is an AHCI option - I havent changed this but remember reading a post which said that this should be set to AHCI - any thoughts?

 

Thanks

Link to comment

Thanks Jonathan / Mr Hexen... I had already spotted BIOS to HDD option which was disabled by default - so thanks for that.

 

On Kenokas advice, I enabled AHCI and promptly lost 4 drives from the array - under devices they were still listed, but unassigned.

 

Not wanting to lose data, I disabled it again and they re-appeared in the array. If they are listed under devices... and I re-assign them to the correct slot in the array - will Unraid try to do a rebuild/format losing my data or will it work OK? To be safe I just disabled AHCI again and all has come back OK.

 

Thanks

Link to comment

Thanks dgaschk

 

I was thinking about it on the way to work - as the disks are showing as unassigned rather than showing the disk as assigned and not recognised, I am sure that if I re-assign them to the same slots it should come back online and see everything. Once I assign them, if I do a manual parity check will it assume that the data disks are OK and rebuild the parity drive... or do I need to tell it to do this - obviously I am worried that it may assume that the data on the parity drive is correct and overwrite some data.

 

With recent problems I have had a few parity checks done recently - the latest being yesterday where it reported 4 errors.

 

I know I am getting off track slightly but can anyone explain this (or tell me not to worry)... on a few occassions the unit has become unresponsive (not often thankfully) meaning that the only way to reset is to force a poweroff from the power button. After such an event the unit starts a parity check - sometimes (yesterday being one) after the rebuild it reports some errors - last night was 4 errors. Once it has identified these errors, what happens - does it assume that the parity drive is right and "correct" the data, or does it assume the data is right and "correct" the parity information? Or maybe it does nothing and waits for the user to tell it what to do from a command line? On the occassions when I have had errors it is typically quite a low number, I think the largest I have had is in the region of 100ish several years ago, and so I have always ignored it!

 

Thanks again.

Link to comment
I am worried that it may assume that the data on the parity drive is correct and overwrite some data. Once it has identified these errors, what happens - does it assume that the parity drive is right and "correct" the data, or does it assume the data is right and "correct" the parity information?
It is my understanding that the ONLY time parity takes precedence is when a write operation to a data drive has failed. In all other instances parity is changed to reflect what is on the data drives. There is a way to force that to change with the command line setinvalidslot or something like that, but that is almost never used.
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.