Tell it to me straight... How bad did i "F" my array?


Recommended Posts

as of right now the last drive just finished rebuilding.  I stopped the array and did another reboot to see if anymore drives were going to magically come up as dead but everything came up ok.  I also just ran the "mover" which was crashing my system.  Everything seems OK now and i am not seeing any new errors on any of the disk so i am thinking of running a parity check over night.  Should i give the system another night to see if it is stable before running another parity check?

 

Thanks everyone for all your help with this.

 

Drabert

 

Link to comment

If your system has more than one failed drive, then you're data is likely already "hosed" to some extent.  Note that if you have more than one bad drive, a rebuild will NOT be successful.    I'm surprised that UnRAID didn't abort the rebuild ... although the status page you posted showing the rebuild did not indicate any other failed drives.

 

I suspect that you may very well have a loose cable, or (as Joe L implied) perhaps a power issue during spinup [although I'd think a 750w PSU would not have any problem -- I have 14 drives with a 650w PSU that is more than adequate].

 

Since your array now shows everything is okay, one thing I'd do BEFORE you do any additional stressful activities (like a parity check or another rebuild) is to ensure you have current backups of your data.    This can, of course, take a very long time (days) if you haven't been maintaining current backups, but it can save you a LOT of headaches if something goes haywire on the server.

 

Link to comment

at this point i am thinking that either it does not like the hitachi drives or it has something to do with the strip size when i changed the host out and subsequently the RAID card as well.  If all data is gone, its not the end of the world.  Its all media that i can put back on after the host is "normal" again.  As of right now i swapped out another drive and it should be finished rebuilding in an hour or so, then after that i will also replace disk #5 and hope that my disk errors go away.

 

The last parity check completed successfully, but like you are saying, if the data is bad, then i just have bad parity :)

 

Will keep posted and hopefully will eventually find the source.

 

Drabert

Link to comment

Since you haven't mentioned it, and since there are a lot of errors in the text of the logfile, have you run a memory test after you saw the errors? My guess is, that the drives and raid-card are physically fine, but your memory is not.

So, I'd first run a memory test for 24 hours, if you haven't done that yet, to exclude the memory.

Link to comment

did not try a memory test, but i started running one this morning since the system started showing errors again when my wife tried to play something from one of the shares.  I was able to stop the array and when i stopped it, it showed that all the drives were "Missing" and some how they were all only 500G drives instead of 2T.  I will let the memory test run today to see if it finds any errors.  If it doesnt, maybe i should find another raid card that i can test with to see if that is causing the issue.  That and i might try swapping out the last few Hitachi drives since the syslog file was showing disk1 (Hitachi) was having the problem while the array was having the issue.

 

attached new log file from this morning

 

Drabert

syslog-2013-01-26.zip

Link to comment

If memory is bad, or incorrectly configured, all other tests are meaningless.

 

Most motherboards set the memory voltage, clock speed and timing correctly.  Some do not.

 

It might be your memory is fine, but the motherboard not configured correctly for your specific make/model RAM strips.

 

Perform a memory test, preferably overnight.  It will display the memory configuration.  Confirm setup is what your memory manufacturer recommends.

 

Joe L.

Link to comment

Just as an FYI. I have seen 2 different cases posted here of high wattage Antec power supplies causing random sectors pending reallocation on drives. Changing the supply fixed the issues. So, don't count out the supplies just because the specifications are good. This sounds like an old server box that was built to run a bunch of drives so they shouldn't be the issue but you never know, especially with older stuff.

 

At any rate, have you ever connected any of these "failed" drives to another system and done read-write testing on them to see if they have actually failed or not?

Link to comment

Im pretty sure it has nothing to do with the power supplies.  This is not a DIY case, its a r515 from dell with dual power supplies.  I ran the memory check and had to stop it at 93% but i will run it again tonight. There were no errors to report but its taking a while to get through the check due to the amount of memory in this host.  Maybe i should pull some dimms out since unRaid does not need 128G for any reason, im just too lazy to pull it out.

 

The only thing i have changed out of this host are the drives and the RAID card, everything else is stock from Dell.  I will replace drive 1 since that is the one that failed this morning and probably keep going through until there are no hitachi drives left.  So far it does not seem to have affected the data.  Ever file i try to access after the drive replacements have been OK but obviously i can not check every file.

 

Can anyone recommend a RAID card with two SAS cable connections? that could be the other cause of my problems, since i only have one of the backplane cables in use at this time since the RAID card i am using only has one slot.  I will try disabling the second connection tomorrow to see if that resolves anything.

 

Thanks guys

 

Drabert

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.