[Solved] Unraid Horror Story (not Unraid's fault!)


Recommended Posts

I have a serious issue...

 

I have an Unraid server running 5.0 RC11 with 5 WD Green 3TB drives, and 4 WD Green 2TB drives. Having nearly maxed out the case I was using I purchased a Norco RPC-4224 and decided to step up the power supply to a Corsair AX-860 so that as I continue to grow I know I will have the power to support it.

 

Yesterday I cleanly shutdown Unraid and moved my hardware over to the Norco case. As I first powered up I heard a pop, and a funky smell - which usually tells me the power supply died (even though it's brand new), however the system still booted up fine, so I ignored it for the moment.

 

Since the Norco uses SFF-8087 -> SFF-8087 I had bought a second SuperMicro AOC-SAS2LP-MV8 controller which I stuck in as I rebuilt and was going to skip the motherboard SATA connections. I also added an extra WD Green 3TB drive for data and a WD 1TB Black drive for cache.

 

As I powered up I was only seeing sporadic drives (6 of 11, or 5 of 11, it would vary). After several hours of testing different drives in different bays (thinking it was a backplane issue) I started trying to mount the drives via external USB on a Windows PC to make sure I could see the drives in question. In Windows I was seeing the drives listed as Disk 1 "Unknown" - Not initialized - which is odd. I know I can't read the Unraid data, but I should still see the drives showing up with proper size.

 

I ended up abandoning the Norco case, and putting everything back into my old case, but continued to have issues. Of the original 9 Unraid drives I could only see 4 when starting up the system (via BIOS & SAS2LP), and via Unraid. I started to freak realizing I may have lost 3x 2TB drives, and 2x3TB drives of data (or course my parity drive was fine, though I would have happily traded it in for a drive with actual data).

 

Having resigned myself to the loss of 2/3 of my data I moved everything back to the Norco case. Unfortunately the Corsair AX-860 wouldn't power up at all, so I had to put in my old power supply for now. Amazingly 2 of the faulty 2TB drives are showing up again in Unraid, but I still have 3 failed drives, which means I am still somewhat screwed - however I am only missing 8TB of data instead of 12TB (all drives except the latest are 100% full).

 

My best guess is that the power supply "pop" toasted the drives in question, however the Norco backplane they were connected too is fine (I can attach other drives to the bays without issue). It also only hit some drives even though all were connected at the time - which is very weird.

 

Since I have 3 failed/missing drives, there is no way to bring Unraid back online as it stands. I know there is a way to wipe out the drive configuration/parity and build a new Unraid array with my 6 working drives, but I want to make sure I follow a proper procedure to ensure I can access the data on my existing drives.

 

Sorry for the long winded rant, but it's been a stressful 2 days, and while I've resigned myself to a lot of data loss, I want to make sure I don't do anything to make it worse.

 

Can someone please advise how best to bring up my server at this point while maintaining my data?

 

Thanks

Bill

Link to comment

I would first 100% confirm that the drives you think are dead *are* actually dead.  Perhaps use either a method of reading reiserfs in Windows, or boot up off a Linux Live CD and see what you can see.

 

Beyond that, you will need a Joe L. or similar to chime in and assist.  Good luck, and sorry to hear about your issues.  That sucks!

Link to comment

The fact that the drives in question don't show up during boot via BIOS or the SATA Controller (there are just gaps for the ports they are plugged into) makes me believe there is nothing else I can do, however I am open to any/all suggestions if anyone can provide ideas on how to bring things back.

 

I was thinking I may be able to try and mount them as external USB drives again, and maybe run some SMART tests, but I don't know if it would even do me any good. I am hoping someone much more knowledgeable than I am (like Joe L.) can offer some thoughts.

 

Link to comment

The reason I suggested the above is I thought that perhaps your SATA controller might have been what was 'fried'.  The fact that they show up in Windows at all supports this potential theory.

 

That's why I'd take your experiment with having the drives attached to Windows via USB that little bit further to see if you can read them with some sort of reiserfs reading utility, like:

 

http://yareg.akucom.de/ + rfstool - http://p-nand-q.com/download/rfstool.html

 

Either that or a Ubuntu LiveCD or similar, or perhaps give this a go: http://polishlinux.org/linux/ext3-reiserfs-xfs-in-windows-thanks-to-colinux/

 

Can't hurt, while you wait for one of the gurus to chime in :)

 

Link to comment

Thanks for the suggestion... I will do so.

 

I did do pretty extensive testing (I have 2 SuperMicro SAS2LP-MV8 controllers, each with 2 SFF-8087 cables). I tried swapping out cables, controllers, backplanes (on the Norco 4224). However, I was very surprised when I got 2 drives back on last boot.

 

I likely won't get to this until tomorrow afternoon, but think I will leave things offline until I can dig deeper.

 

I appreciate you chiming in.

Link to comment

It's interesting that the power supply still worked after the "pop" you heard -- which I assume it did, since you indicated the system still booted at that point.

 

Are you CERTAIN that the motherboard was mounted correctly in the case and there were no shorts due to missing and/or extra standoffs?    What you may have heard was some component on the motherboard shorting -- perhaps the SATA controller or a PCIe bus controller chip.    If that was the case, it'd be normal to have SATA drives not responding.

 

Depending on just what you have available for testing purposes, I'd so one of the following ...

 

(a)  First, I'd use a different motherboard and see if the system boots okay to an UnRAID flash drive with NO drives attached.    If so, shut it down;  attach a few drives to the motherboard SATA ports (perhaps 3 drives); and boot to the UnRAID flash again.    Don't try to start the array -- just see if it shows the drives.    If so, shut down; and repeat the process with the next group of drives.  If you do this and see ALL of the drives, then the drives themselves are likely good.

 

(b)  Same as (a), but simply use a different spare system.    If you have a system to use for this, you can quickly test the drives themselves. (again, connecting a few at a time -- how many depends on how many motherboard SATA ports you have)

 

IF the results of whichever method you use above show that the drives themselves are good; then you need to isolate just what's wrong.    Since things didn't work when you moved it all back to your old case, then IF the issue was a short, it likely fried the motherboard -- but it could have also damaged the PCIe bus and blown your AOC-SAS2LP-MV8(s).  That's why you should use onboard SATA ports on a "known-good" system to check the drives.

 

Hopefully you'll find that the drives themselves are good -- in which case as long as you don't write to any of them, you'll be able to start the array in its previous configuration once you get the defective component(s) replaced.    If you have a bad drive (or drives), then (as you know) you'll lose the data on the bad drives;  but you can recreate a new array with the good drives and a new parity drive (which can be the same one you were using if it's still good).    The one key question is do you know WHICH drive was the parity drive?

 

Link to comment

I agree the "pop" was strange. Usually when I've heard it before it's the power supply, but the power supply is dead after that, whereas in this instance it still worked (though I did notice the internal facing fan was no longer working). After moving my motherboard back to the old case and testing, and then back to the Norco case with the power supply issue the system wouldn't even power up anymore (I'd get the light on the motherboard come on, but that was it). I put in my old power supply in the Norco case and it booted up right away, so I am leaning towards it being a power supply issue, but it is worth testing with another motherboard (which I thankfully do have).

 

I am somewhat confident the AOC-SAS2LP-MV8s are okay as during my testing I was swapping SFF-8087 cables, SAS cards and ports to try and rule out those being an issue (as well as moving the drives around the bays). I think I tried all different combinations to confirm the cards and cables were good, but I am the first to admit it wasn't completely scientific, and I may have missed something. I also tried all the drives with the on-board SATA on the existing motherboard and had the same drives come up missing, which is why I was thinking it was a drive issue - however I was really surprised when 2 of the faulty drives showed up on my last boot with the Norco case.

 

I had also tried a few of the faulty drives in an external USB enclosure. Unraid drives that were good would show up in Device Manager in Windows as partitioned with the correct size (though obviously unreadable without reiserfs), but the drives in question would not display the size, and reported that they were uninitialized. If I tried to initialize  them then Windows popped up a message that the drive was not ready.

 

Thankfully I've read these forums enough in the past that I printed a screenshot of my drive assignments prior to moving hardware so I definitely know which is my parity drive.

 

I will try your suggestions with the new motherboard and a few drives at a time and see what happens. I should have thought of that myself. I will try after work today and report back.

Link to comment

A device like this is a real life saver in times like this,

StarTech SAT3510BU2E Aluminum 3.5" Black eSATA USB Trayless SATA External Hard Drive Enclosure

http://www.newegg.com/Product/Product.aspx?Item=N82E16817707169

 

Or one of those eSATA/USB docks.

Do you hear the drive spinning? If you bring it up with a linux system, can you do an fdisk -l to see the partition?

 

I would not initialize a drive in windows, who knows what that will do.

I would boot from a live distro. Even booting unraid on another key without any configuration data.

At least from there you can login do an fdisk -l or mount the drive.

 

I've even done this with laptops and two usb docks.

heh. I built a special recovery station just for this since some of my drives were underwater.

 

http://www.newegg.com/Product/Product.aspx?Item=N82E16817707169

 

While you don't have to go that far. I would certainly boot up unRAID and look at the drive, do smartctl -a and review it.

 

If it's not detected at all in the system, I would smell the drive electronics. (yeah really) see if you can spot any electrical issues.

 

Certainly does sound like something was sent through your system and hit the drives.

Link to comment

So I have managed to run a series of tests...

 

I started with another motherboard/cpu/ram that I had lying around and did the following:

 

- Confirmed Unraid starts successfully with no drives attached

- Attached 2-3 drives directly to the on-board SATA on the motherboard and checked the following:

    - Is the drive visible in the BIOS

    - Is the drive visible in Unraid GUI

    - Is the drive visible via: dmesg|grep SATA|grep link

    - Is the drive visible via: fdisk -l

 

The 3 drives I had considered "failed" did not show up in any of the above tests (I am guessing if they don't show in the BIOS the rest of the tests were redundant, but I wanted to confirm).

 

I also attached each disk to an external USB drive bay on a Windows 7 machine. Disk Manager saw each drive come online but it showed as "Disk 1: Unknown - Not initialized".

 

Since they are all Western Digital drives I installed the WD Data LifeGuard Diagnostics utility, and while it did see a drive attached to USB it was not able to provide me model number, serial number, capacity or SMART status. What is odd is I ran the various diagnostics and they all passed.

 

I also tried the "sniff test" WeeboTech suggested, and I do smell a faint trace of something, but since the "pop" happened Saturday afternoon, which was over 48 hours ago, I am going to guess this test is less conclusive than if I had been able to check just after the event (though it likely would have been masked by the power supply smell).

 

I am still baffled that the power supply pop travelled through the Norco RPC-4224 backplanes and hit some drives across 3 backplanes, but did not affect the backplanes themselves, or all the drives attached. I am very happy it didn't kill all 11 drives, but it's odd it killed 4 (3 with data, and 1 new one that was to be my cache drive).

 

Unless anyone else has some suggestions on other tests to run I am back to assuming my 3 data drives are dead.

 

Provided the smart Unraid guys are in agreement, can someone provide me a recommended path to get Unraid up and running again even though I am down the 3 data drives?

Link to comment

Since they are all Western Digital drives I installed the WD Data LifeGuard Diagnostics utility, and while it did see a drive attached to USB it was not able to provide me model number, serial number, capacity or SMART status. What is odd is I ran the various diagnostics and they all passed.

 

Last ditch effort would be to attach the drive to a sata port on your windows machine. (if you want to go that far).

I dunno how you could run drive diagnostics on the USB connection if the drive is dead.

Does it spin?

 

Then try running the WD diagnostics. I would only do one drive at a time.

 

For all you know, one of the drives could have been inserted into the norco a lil skewed and shorted something.. or a screw fell. or there really is an issue with the power supply. Again sniff test on the PSU is of paramount importance.

 

Something fried somewhere, Bright light (LED's are brighter) and magnifying glass on all pc'bs. Look at the caps to see if any are bulging or leaking.  Normally the PSU would have shut down if there was a short, but instead, something else gave.

 

One time, at band camp... Oh no wait, that's another story....

One time my Pentium PSU fried.. Almost every component. I replaced the motherboard with a friend by my side.

In jest I said.. Now if you see sparks jump back. I was joking.. but don't you know it.. I turned on the power and all the diodes lit up sparked and burnt out right in front of her. She freaked... So did I. I later found 12V had leaked into the 5V line.

 

In any case, I would attempt the WD diagnostics from an SATA port.  It makes no sense that they ran successfully via a USB port, but the drive does not work on an SATA port. I would only try one drive at a time.

 

If all else fails you can bring up unRAID and do an initconfig via the console to re-initialize the super block and reassign the drives that do have data.

Link to comment
In any case, I would attempt the WD diagnostics from an SATA port.  It makes no sense that they ran successfully via a USB port, but the drive does not work on an SATA port. I would only try one drive at a time.
WD Diag will run a generic drive test that doesn't really do much if it can't correctly identify the drive or read the smart info, but detects that a drive is at the other end of the cable. I see it often if the SATA or IDE controller is not supported, and WD Diag shows it as SCSI, RAID, or USB. Connecting it to a controller that WD Diag supports will allow the full tests to run.
Link to comment

I agree you've confirmed that the 3 drives are indeed dead.

 

All you need to do now is reinitialize the array.    I'd do this WITHOUT parity at first.  Just install all of the known-good drives; then boot to UnRAID.    Now either Telnet into the server, or (if you have a keyboard/mouse) just use the console.  Log in to root;  and type "initconfig"  (without the quotes).    This will initialize the configuration.    You can then assign all the drives as data drives -- and you should be good to go.    Now shutdown; add the parity drive;  reboot;  assign the drive as parity -- and let the system build the parity drive (a long process).

 

When that's done, you can add more data drives (I'd preclear them first);  and then copy all of the missing data to the array from your backups.

 

Conceptually it's pretty simple -- it's just that some of the steps take a long time  :)

 

I'm surprised the PSU zapped the drives and didn't damage the motherboard.    I suspect you had a short somewhere in the Norco backplane that shorted that specific bus on the PSU -- and the drives that were connected to that specific power feed are the ones that got "zapped".   

Link to comment

 

mccoy_hockey_stick_its_dead_jim.jpg

 

 

Actually, chances are the platters are good if you can hear it spinning up. so you might be able to swap electronics or send it off to a data recovery center if the data is that important to you.  Now if the drives were spinning and in use when the pop occured, chances are the spike would have disrupted the magnetic format.  In any case, reiserfs is pretty resilient and we seen people overwrite the hard drive partially, yet still be able to recover some of the data.

Link to comment

Thanks for the replies guys. The data on all 3 drives is movies, so provided I can figure what I lost I can likely get most of it back with some time. I may continue to play with the drives, but since 2 motherboards and 2 SATA cards can not read the drives in my Unraid server I am going to assume that even if I get Windows to recognize the drives I am likely not going to easily get them back into Unraid.

 

I am going to initconfig the Unraid server to get back up and see what is left, and then once I've completely given up on the drives I will RMA them (thankfully they are all under the 3 year warranty) and then start rebuilding data.

 

If I can magically get the drives recognized in Windows I am assuming I can use reiserfs to copy from Windows to my Unraid server, correct?

 

f451 - That is a disconcerting article. What is odd is in my scenario it took out 2 3TB drives, 3 2TB drives and 1 1TB drive. Thankfully 2 of the 2TB drives came back (I have no clue how), and the 1 TB drive was a new cache drive. I've also plugged other drives into the same backplane ports where the failed drives were originally (including 3TB drives) and it all appears to be working fine again.

 

I've worked with computers for 20 years both professionally and personally, and I think this is likely the strangest issue I've faced.

 

The article from f451 makes me second guess the Norco case, but I know there are a number of Unraid enthusiasts who use them successfully, including Rajahal, Johnm and others and I have to assume they are using 3TB drives by now and there would be lots of warnings here about the case - when instead it's actually recommended for large builds.

 

I think I will just stick with the case and push my luck. :)

 

Link to comment

So even though they appear to be working you think I may still face issues? Since I still have the case open I will pull them out and see if I can see any physical damage.

 

To be honest, I don't know a lot about electricity, so power supplies are not my strength. Do you think the Corsair AX-860 may have been a mistake (other than it blowing up)? I wanted something with sufficient power to cover all 24 drives if they are up and running (i.e. during a parity check/build), and looked through the different Corsair families and saw that AX was their top of the line. I figured the extra money would be well spent, but maybe not. I have a Corsair CX-500 in the case now and it's happy, but that is only with the 11 drives.

Link to comment

Just check the boards, what you experienced, and what the other post leads to sounds too close for comfort.

 

I've used corsair PSU's without issue before.

 

You heard a noise and smelled something burning. Something is wrong and you should check all points of contact just to be sure.  While parts can be replaced, data isn't always as easy to replace. If you go with this combo, make sure your critical data is backed up and copied offsite.

 

Link to comment

I checked all 3 backplanes that were in use, and don't see/smell anything. All ports that had failed drives now have valid drives and Unraid is back up and running (finally!).

 

I still need to poke around with the defective drives, but I am happy to be back online.

 

Do you have any thoughts on whether I should step down the power supply from an 860 to something lower to reduce the risk of future issues?

Link to comment

The AX860 is an excellent PSU, so I don't think that was the issue.  I think it's far more likely that SOMETHING was shorted ... and most likely it had to do with the connections to the backplane, as if it was on the motherboard I'd have expected something to be damaged there.    As for the high-power bus contributing to the issue ... it very well could have caused more damage due to its high current capability (so the short could have dissipated more power);  but in general there are more pros than cons with single rail designs.

 

Get a replacement for the Corsair, and you should be just fine.    Just be SURE that the modular connections are in the correct slots;  are tight; and are firmly seated in the power connections for the backplane.

 

Link to comment

I realise that this is wandering a little off-topic, but it may help someone...

 

I note that the author in the linked report blames the MOSFET devices in this article.  I don't have any direct experience of the Norco product, but I have in the past designed power switching circuits for hard disk drives to allow hot-swapping.  I doubt that the real problem here is the MOSFETs.  It seems much more likely that one or both of two other possibilities applies.  Either the circuit is able to switch on the +5 and +12V rails to the drives without ensuring that there is an adequate return path to ground, so effectively pulling the drive's ground positive, and forcing current from the +12v rail on to the +5v rail which could readily kill drives and potentially anything else on the same +5v rail; Or the circuit fails to switch on the MOSFETs sufficiently hard (or switches them with too long a transition from Off to On) so that the peak power dissipation ratings of the MOSFETs is exceeded and they then fail, possibly going short circuit and then failing to protect the drive in the event of any disconnection of the power ground.  Either way, it looks like poor design rather than poor components.  (Sadly, that is why most electronics fails in my experience.)

 

I would be very wary of trusting any drives in the chassis in this particular instance (in this thread).  If something went wrong once, it could happen again. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.