Jump to content

HDDs dropping off the system...


Recommended Posts

Yes, the randomness of the issues does tend to indicate unstable power.

 

So now you can decide whether to use the hot-swaps or the CoolerMasters  :)

 

If you're really ambitious, you can mount the drives in the hot swaps; do a parity check & note how high the temps get after a few hours;  then repeat the process with the CoolerMaster cages.    There's NO doubt the CoolerMasters will keep the drives cooler -- but just how much depends on a variety of factors.

 

But it does seem very likely that power stability has been the issue all along ... although the 48-hr test you ran didn't show any signs of that.    Note that another possibility is SUPPLY instability ... i.e. variation in your line voltage that's causing fluctuations in the PSU output.  But if that's the case, that will certainly be resolved by a UPS with AVR.

 

Link to comment
  • Replies 117
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Note that another possibility is SUPPLY instability ... i.e. variation in your line voltage that's causing fluctuations in the PSU output.  But if that's the case, that will certainly be resolved by a UPS with AVR.

 

How likely is this? I did wire the plug up myself... Heh. I guess I could test pretty easily by taking it to a different outlet. That would be pretty embarrassing. I'll check it with a multimeter but seems like it's kinda hard to mess up.

Link to comment

I wasn't referring to the plug (although if poorly wired it's possible) ... I was referring to the local utility company's stability -- some utilities are subject to a lot of power fluctuations, brownouts, etc. 

 

Whew! I think with the area I'm in, it shouldn't be an issue. I'd be pretty darned blown away (and eventually out a ton of money) if that was, in fact, the problem! :)

Link to comment

I would say that 850W (70A) would be way overkill for 12, mainly green, drives, but a Seasonic X is an excellent choice.

 

I don't disagree! I think mostly I'm upgrading to known quality PSU manufacturer and/or eliminating that from the troubleshooting process. Worst case scenario, I send it back when I find out that's not the issue. Best case scenario, I'm also future-proofing. ;)

Link to comment

Well... parity rebuild completed but disk10 had a few read errors during the process. After the rebuild completed (see syslog below) I got a write error and then bam, disk dropped out. Disk10 is an older disk so (and I'm probably being naive here) it's possible THAT disk is actually having problems. I'm going to try to run a few smart tests against it to see what's up and I'll paste those in here once I have them.

 

Syslog is here: http://pastebin.com/1t736wQA

 

 

(Edit: As an aside, I should probably move my VM datastore off of this array until it's stable, the process of shutting down 8 VMs, working on the array, refreshing VM status, and starting them back up again just adds to my headaches! ...not to mention when I forget to stop the VMs and then ESX freaks out. Ha!)

Link to comment

Here is the SMART status report: http://pastebin.com/MKGgTZya

 

I guess the following items concern me (but I know nothing, at the moment, about SMART status reports):

  1 Raw_Read_Error_Rate     0x000f   113   097   006    Pre-fail  Always       -       103272373
...
  7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail  Always       -       383254921

 

Thanks for the AWESOME community support! You guys rock.

 

 

EDIT: Short offline SMART Self-test "Completed without error".  Running long test now. Will report back in an hour...

Link to comment

I haven't looked at the smart report yet, but with those numbers on those attributes I'm guessing it's a Seagate drive. If it is those numbers don't mean anything. I have a couple Seagate drives and those are extremely high compared to my others. From what I have read companies have different ways and tolerances on when these log the errors.

 

 

Edit - Just looked at the smart report and is indeed a Seagate drive. When I get a chance I'll look at my Seagate smart reports and see if they follow. I know the two you posted follow mine, but I am unsure of this one (195 Hardware_ECC_Recovered  0x001a  080  069  000    Old_age  Always      -      242506871). Perhaps somone will chime in to verify that this is also ok before I get time to look at mine.

Link to comment

I haven't looked at the smart report yet, but with those numbers on those attributes I'm guessing it's a Seagate drive. If it is those numbers don't mean anything. I have a couple Seagate drives and those are extremely high compared to my others. From what I have read companies have different ways and tolerances on when these log the errors.

 

Thanks! Yes, it is a Seagate.

Link to comment

Well... Extend offline SMART Self-test completed without error so I guess that's really crappy news. Drive seems like it's okay... I guess I'm out of ideas. A couple of days ago I switched my BIOS' SATA Mode Selection from IDE to AHCI but that clearly didn't make any difference... so what now? Swap out motherboard and CPU and get new PCIe HDD controllers? Ugh. Thanks in advance.

-Dan

Link to comment

Very frustrating results.    Given the number of random issues you've had, it's very difficult to point a finger at any one cause.    And testing, of course, is a real PITA ... not because it's difficult -- takes just a few minutes of YOUR time -- but because it takes MANY hours of elapsed time while the system runs.

 

There are two things I can think of to try now:

 

(a)  On the off chance that this is actually a real disk error, you could try rebuilding without Disk #10 in the array.

 

(b)  Boot to MemTest and let it run for at least 24 hrs ... preferably even longer.  Since your system has unbuffered RAM, a rare bit-error would NOT be corrected, and could cause a variety of issues.    I have an HTPC that had random drop-outs perhaps once/month - ended up being a memory failure, but it passed MemTest for hours before it finally encountered an error.  Changed the RAM -- and that PC has been running nonstop for over 2 years since.

 

Link to comment

Thanks Gary... I'm sure I'll eventually get to that (memtest) point. I'm definitely willing to try anything but I'm skeptical of it being a memory issue because those sticks of memory ran in a separate box for 2 years prior to being brought over to my unRAID setup. Anything is possible though! ;)

 

After running several SMART tests on the (previously) failed disk, the array is back up. Currently, I'm running a parity check and no errors so far. Of course, this means nothing at this point in the game but I'm curious if it will complete without any errors. If that happens, I'll start using it heavily again and see if the problem re-surfaces with a different disk. Then I'll go the memtest route. After that, I'll look at a new motherboard (off of the approved list). Next, new controllers (also off of the approved list). Finally, if I'm still having issues, I'll roof test the whole blasted thing. :)

Link to comment

Since your system has unbuffered RAM, a rare bit-error would NOT be corrected, and could cause a variety of issues.
The term you are looking for is ECC, not buffered vs unbuffered. Typically ECC is buffered, and standard RAM is unbuffered, but it's not the buffered part that provides the error correcting.

http://en.wikipedia.org/wiki/ECC_memory#Registered_memory

 

I wasn't "looking for it" ... but did forget to note "unbuffered, non-ECC".  It's true, of course, that ECC is far more prevalent in registered/buffered modules, but there's no technical reason for that.  Buffering and ECC address two different issues:  bus loading and error correction.  I've used ECC modules whenever possible since the mid-1970's  :)

Link to comment

If your server was in IDE Emulation mode then a failure on one channel of a master/slave pair could lock up the other too.  Leave the BIOS set to AHCI mode.  It will make analysis easier of any failure.

 

So it will be done! Thanks for the heads up.

 

Unrelated: I have noticed that my system takes about 2-3x the time to boot since switching to AHCI mode... but that could be attributed to something else. If I ever get this sucker stable I'll look into that... as you can imagine, boot time isn't my top priority at the moment! Haha.

Link to comment

It's not uncommon for the AHCI discovery process to take longer than legacy IDE mode.    As long as it's still booting, you're fine  :)    [i've seen a few systems that will simply hang when switched to AHCI ... some older SATA drives simply aren't compatible with certain controllers in AHCI mode.]

 

Link to comment

Well, now it gets interesting... Disk 10 (sdj) failed (again) as soon as I tried to make changes to the files that live on that drive... so this is promising to me that it's the same disk that failed. That makes me happy. IIRC, this drive has a slightly mangled SATA data connector... luckily, I have a spare that I can swap out when I get home tonight. More later!!

 

(Makes me wish I had kept my hot-swap bays in there though... Oh, and I always love it when troubleshooting and you find out that it's actually multiple issues. Just compounds the required effort!  ;D)

Link to comment

Yes, multiple issue are a HUGE PITA !!

 

"... a slightly mangled SATA data connector ..."  is certainly a good reason to replace it !!  :)

 

You MAY just be on the edge of actually having this all resolved  8)

 

If you replace the connector, and still get a Disk 10 failure, then I'd either replace Disk 10, or rebuild the array without it.

 

 

Link to comment

...  Makes me wish I had kept my hot-swap bays in there though...

 

Well, you could always go back  :) :)

... however, before doing so (if you're really so inclined), I'd do the experiment I noted in post #50

 

Once (if?) it all gets fixed I think I am going to go back. Mostly just because I can return these CoolerMasters but would have to go through the trouble of selling the Thermaltakes on eBay. Call me lazy! I'll definitely get drive temp numbers though and hopefully likely write up a review on the Thermaltakes.

 

Oh, and I misspoke earlier, the SATA connection that was mangled was on the drive side so I went ahead and replaced it with another drive I have... it's pre-clearing at the moment. SMART status report shows that this drive has 38972 Power_On_Hours! I think I'll be shopping for a replacement soon. :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...