Jump to content

Lots of Drive issues after upgrading hardware


fonzie

Recommended Posts

I have been having some issues with my unRAID build these past few months and I've been pulling my hair out trying to resolve them. I've troubleshooted for months now and finally realized that I haven't used the greatest help possible, which is all you experts on this forum. So let me give you a little backstory.

 

I had been running my original unRAID build for a few years with absolutely no problems but it was time for an upgrade because I wanted to use all the nice new features that unRAID 6 brought to the table (VM, dockers, etc)

 

My OLD original underpowered hardware:

 

Motherboard: ASUS M4A785-M

CPU: AMD Sempron 2.8GHz

RAM: 2GB Corsair XMS 675MHz (2x1GB sticks)

SATA ADAPTER: Supermicro AOC-SAS2LP-MV8

SATA ADAPTER: SATA2 Serial ATA II PCI-Express (Silicon Image SIL3132)

PSU: Corsair 650W

Total of 10 drives (including parity and cache)

 

I upgraded quite a few things in my box, namely the motherboard, ram, cpu, power supply, added two video cards and a new case.  I guess the only original things I kept were the drives and the expansion cards and of course the flash drive.  Shortly after, I started experiencing problems with drives.  The first time it happened 2 drives "failed" on me so I thought I had data loss and replaced them.  It happened again soon after and I realized the chances of that many hard drives failing were very unlikely.  So I ran preclears on my two original "failed" drives on a separate computer and they passed... which confirmed my suspicion that it was a problem with some piece of hardware in my new unRAID build and not the drives themselves. I began to systematically test each hardware component to rule out issues.

 

-tested the PSU and it is working properly and does have enough juice to supply power to my entire rig (feel free to cross check this as I may be wrong)

-swapped out the two SAS to SAS 36-Pin cables and that seems to be working fine as well

-swapped out and tested both the SAS RAID controller card and the SATA2 RAID controller card with my buddy who has two identical cards, and that doesn't seem to be the problem. I also moved them to different slots on the motherboard

-changed some bios settings on the SAS controller to the same settings that my buddy has on his (he owns the same card and has no issues)

 

*****One thing to mention is that I get an "Error PD device not ready" sometimes when booting up unRAID and I must press any key on the keyboard to continue booting up.  I noticed that when this happens is the times when I usually have drive missing or disabled issues.  This didn't use to happen with my old unRAID setup.  Curiously enough, my friend gets the same "Error PD device not ready" notification sometimes, but he does not have any drive issues at all.  (he does have a different motherboard and CPU though)

 

*****The errors have occurred on different drive trays...so they are not isolated to the same slot every time.

 

At this point, I'm thinking it may either be a motherboard issue (maybe the PCI slots cannot supply the full capacity to all the drives I have and the additional 2 gpu cards I have added) or maybe the backplanes on my new NORCO case are faulty?? Could it be a RAM issue or the Norco Reverse breakout cables?

 

Here's my setup for reference.  Maybe some keen eyes can find something that I overlooked or am not aware of:

 

M/B: Gigabyte Technology Co., Ltd. - 990FXA-UD3

http://www.newegg.com/Product/Product.aspx?Item=N82E16813128514

 

CPU: AMD FX-8350 Eight-Core @ 4000

http://www.newegg.com/Product/Product.aspx?Item=N82E16819113284&cm_re=amd_8350-_-19-113-284-_-Product

 

RAM: 16384 MB (max. installable capacity 32 GB)

http://www.newegg.com/Product/Product.aspx?Item=N82E16820148540

 

GPU1: SAPPHIRE Radeon HD 4830 DirectX 10.1 100265L 512MB 256-Bit GDDR3 PCI Express 2.0 x16

http://www.newegg.com/Product/Product.aspx?Item=N82E16814102822

 

GPU2: EVGA 02G-P4-3658-KR GeForce GTX 650 Ti BOOST SuperClocked 2GB 192-bit GDDR5 PCI Express 3.0

http://www.newegg.com/Product/Product.aspx?Item=N82E16814130910

 

SATA ADAPTER: Supermicro AOC-SAS2LP-MV8 Add-on Card, 8-Channel SAS/SATA Adapter with 600MB/s per Channel

http://www.amazon.com/Supermicro-AOC-SAS2LP-MV8-8-Channel-Adapter-Channel/dp/B005B0Z2I4/ref=sr_1_10?ie=UTF8&qid=1451497676&sr=8-10&keywords=sas+card

 

SATA ADAPTER: SATA2 Serial ATA II PCI-Express RAID Controller Card (Silicon Image SIL3132)

http://www.monoprice.com/product?p_id=2530

 

CABLES: 2x Norco C-SFF8087-4S Discrete to SFF-8087 Reverse Breakout Cable

http://www.amazon.com/Norco-C-SFF8087-4S-Discrete-SFF-8087-Breakout/dp/B002MK7F0Y/ref=sr_1_1?ie=UTF8&qid=1451505475&sr=8-1&keywords=reverse+breakout+cable+norco

 

CABLES: 2x 1m 30AWG Internal Mini SAS 36-Pin SFF-8087 Male to Mini SAS 36-Pin SFF-8087 Male Cable

http://www.amazon.com/gp/product/B008VLHOR2?psc=1&redirect=true&ref_=oh_aui_detailpage_o00_s01

 

PSU: Corsair RM750

http://www.newegg.com/Product/Product.aspx?Item=N82E16817139055&cm_re=corsair_rm750-_-17-139-055-_-Product

 

Case: NORCO 4224

http://www.amazon.com/NORCO-Mount-Hot-Swappable-Server-RPC-4224/dp/B00BQY3916/ref=sr_1_1?s=pc&ie=UTF8&qid=1451507089&sr=1-1&keywords=norco+4224

 

Total of 12 Drives (including Parity and Cache)

 

0ztQYI8.jpg

link to full size image: http://i.imgur.com/YrWzxaH.jpg

 

I just got a missing drive error this morning when I started doing a parity check so I shut it down and came here for help.

 

I can supply additional information for each additional drive in my array if that would be helpful in determining my problem.  Just let me know. thanks.

Link to comment

I'm not sure.  I've had lots of drive failures in different drives, so I would just go to tools-->new config and take note of the drive order.  restart it and set up the drives in the same order again.

 

What I'll do right now is change the two SAS cables from the top two back planes to the bottom ones and see if there is an issue.  I'll get back to you

Link to comment

uIyBF8v.jpg

 

That is the drive that failed.

 

 

I just swapped the cables that were connected on the backplane of the Norco. I moved the top ones that were connected to the SATA2 Serial ATA II PCI-Express and connected them where the  SAS2LP were. I restarted twice and haven't seen the "Error PD Device Not Ready" notification.

 

I'm going to set a new config so that I can have all my drives functioning again...and then I will wait for another drive to give me an error.  Once it does, I will do as you suggested and post a complete zip from the diagnostics page.

Link to comment

Okay, so I just had another drive failure.  I went to tools-->diagnostics and saved the zip file.  My concern is that the zip contains passwords and information from some of my dockers.  I want to post the zip as soon as possible so can someone please tell me if it is safe to do so, or which files I should exclude from the zip.  thanks.

Link to comment

After doing more research, I think my Gigabyte motherboard might be causing an issue.  It has dual bios and I'm pretty sure it has HPA.  I've had it running like this for a few months now, so I don't know how much damage I've done to all my hard drives with those hidden partitions.

 

I'm going to swap out the new gigabyte motherboard with my original asus motherboard that gave me no errors.'

 

If that turns out to be the problem, how do I reverse the damage that has been done to my hard drives by the HPA?  for example, how do I narrow down which drives were affected?

Link to comment

Yep. This came out in my syslog:

 

Dec 30 13:15:35 media kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0
Dec 30 13:15:35 media kernel: sas: ata11: end_device-1:0: dev error handler
Dec 30 13:15:35 media kernel: sas: ata12: end_device-1:1: dev error handler
Dec 30 13:15:35 media kernel: sas: ata13: end_device-1:2: dev error handler
Dec 30 13:15:35 media kernel: sas: ata14: end_device-1:3: dev error handler
Dec 30 13:15:35 media kernel: sas: ata15: end_device-1:4: dev error handler
Dec 30 13:15:35 media kernel: ata15.00: HPA detected: current 625140335, native 625142448

 

 

So what steps do I take now to rectify the problem?  Obviously I will be swapping out the motherboard.  But I want to preserve all my data and avoid any corruption if possible.

Link to comment

All the dev error handler lines are just informational. They are not errors.  The only issue (and its not a real issue unless the drive in question is your parity drive) is hpa.  Plenty of threads around here in how to remove it so that you get that 16k of storage back.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...