Massive Hardware Failure - Supermicro


Dr. Ew

Recommended Posts

After months of design and testing, then implementation and testing, I finally finished my AI, Deep Learning, & AR lab. It's a complex system, for which the expanse exceeds the necessity of explaining my current situation and hardware failure. 

 

One UnRAID server's storage system is failing (or has failed). It's a SuperMicro 8028 (Server nAR4), it has two connected 12-bay enclosures (x2 6027tt that has been converted to solely being a storage enclosure - nARstore4a, nARstore4b). The server itself contains 8 1tb SSD's. 4a contains the unRAID data disks, and 4b contains a RAID10 array which is a super fast unassigned drive. Storage Enclosures are not. The 6027tt came with the 8027hd backplane. I removed these backplanes and replaced them with sas2el1 backplane. For a month, everything worked properly, just as it should. 

 

There is nothing wrong with the 8028 server or its storage. Both storage enclosures are failing or have failed (trying to determine). 

 

Last week the unRAID system spun up a few times with missing drives, and then was normal for a few days. This happened with both enclosures. As UNRAID was loading, a few times, I saw super fast script running, and all I could make out was 'error', it was too fast to read. Yesterday, both backplanes went fully offline. 

 

It's very strange, and I have taken nearly everything out of the equation and still can't figure out what's up. It's very strange both enclosures would fail at the same time. 

 

Here is what it looks like

Server nAR4 -> nARstore4a -> el1 backplane ->- IronWolf HDD's -> LSI 9286 (12 drive R10) 

                    -> nARstore4b -> el1 backplane -> IronWolf HDD's unRAID Data -> Areca HBA

 

After going completely offline, it came back, missing several drives. I fiddled around with it, took the trays out, put them back in, the array came back, then went away again. 

 

Ocasionally, a few hard drives will spin up, sometimes they all spin up, but ultimately both enclosures end up offline. 

 

I tested on separate server, I tried different HBA, different controller, different cables, different PSU, verified drives are okay. I've tried every failure point. Even changing out the backplane for new ones. 

 

Something weird is going on, and there is nothing that makes sense. I tried an HP HBA, another LSI RAID card. The BBU on both controllers is showing failed, even though all its metrics look fine.

 

My conclusion is these are the only logical explanations:

1) Something unknown cause killed both hba's and both controller's, along with their batteries. Very improbably, but I have a new Supermicro HBA on the way to check this.

 

2) BIOS setting replicated in both Server nAR4, and other servers I tested to make sure it wasn't server itself. Doubtful, but i'll try resetting bios a second time.

 

Those are my only two possible explanations at this point. Everythung else has been tested. How it went from working fine to this type of failure is insane. 

 

Any Suggestions?

Link to comment

I should also mentioned some of the error messages reeceived. 

-BBU Failure

-PHY Error on some disks, then fine, error on all disks, then fine. 

-Diagnostic System Error - Backplane

-Backplane power error (even with new PSU, other PSU's)

-Several other stange errors I need to document.

 

The very peculiar final thing worth mentioning; the el1 backplane has two minisas ports. One to server. One for daisychain. I plugged a 6 bay ssd enclosure into the daisychain port. Installed 3 drives, and these all show up fine. So thats confuses me even further.

Link to comment
  • 2 weeks later...

Turns out it was just the battery apparently. The battery was fine, just needed to be reset. The automatic retraining was to take a month, so I did manual, and it zipped back into working order. 

 

i don’t  know why it caused that huge of a problem, but it’s fixed now.

  • Upvote 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.