Moving disks to a new server build


Recommended Posts

I'm planning on building a new server that is quieter and more power efficient than my existing one.  The only parts I'm going to keep from my existing UnRAID build are the disks (8x HDD, 1x SSD) and perhaps the Syba PCIe x1 to SATA x4 cards if the new motherboard doesn't have enough ports.

 

Is there anything special I need to do when I move the disks over to make sure UnRAID correctly maps which SATA port is which disk (data, parity, cache) in the correct order?  Or will it figure that out on boot regardless of which port each disk is connected to?

 

Also, the motherboard I pick probably won't have 9x SATA ports for all of my disks.  I'll re-use my Syba PCIe/SATA cards for that.  I'm sure I'd want to put the SSD cache on any of the fastest SATA ports (gen3) provided by the motherboard chipset, and I assume I'd want to also put the 2x parity disks on higher bandwidth SATA ports as well, only putting the data disks on the slower SATA expansion cards.  The parity disks will see more read/write traffic than any of the individual data drives, right?

Edited by cowboytronic
Link to comment
3 hours ago, cowboytronic said:

I'm planning on building a new server that is quieter and more power efficient than my existing one.  The only parts I'm going to keep from my existing UnRAID build are the disks (8x HDD, 1x SSD) and perhaps the Syba PCIe x1 to SATA x4 cards if the new motherboard doesn't have enough ports.

 

Is there anything special I need to do when I move the disks over to make sure UnRAID correctly maps which SATA port is which disk (data, parity, cache) in the correct order?  Or will it figure that out on boot regardless of which port each disk is connected to?

 

Also, the motherboard I pick probably won't have 9x SATA ports for all of my disks.  I'll re-use my Syba PCIe/SATA cards for that.  I'm sure I'd want to put the SSD cache on any of the fastest SATA ports (gen3) provided by the motherboard chipset, and I assume I'd want to also put the 2x parity disks on higher bandwidth SATA ports as well, only putting the data disks on the slower SATA expansion cards.  The parity disks will see more read/write traffic than any of the individual data drives, right?

 

This should go pretty easily. unRAID reinstalls on every boot - so there is nothing to uninstall or change. unRAID recognizes disks by the serial numbers, so your USB drive should just work as is if you can coerce your new motherboard to boot from it.

 

I would not be wanting to use that PCIe x1 SATA card for more than 1 (possibly 2) disks. With 1 disk you are getting about ~210 MB/sec, which is pretty good. 2 disks in parallel, you'd be maxing out at ~105 MB/sec - not stellar. With 5 it would be more like 52, which is altogether awful. Instead I might suggest buying a LSI SAS9201-8i. You can get on eBay for about $50. 8 fast ports. You'd need a PCIe 2.0+ x8 slot optimally, but if there is an x16/x8 slot with only 4 lanes active (acts like an x4 card), it would be fine for 8 spinning drives (~ 200MB/sec per drive).

 

Good luck!

 

You'll want to steer clear of HBAs that feature Marvell chips. They have been shown to have problems with virtualization and can cause corruption. If you have one and want to make it work, there are steps you can try. But if buying something new, I'd steer clear. I upgraded my preventatively - I didn't want anything with a known chance of corrupting data installed. Even a motherboard with more than 6 SATA ports, may be adding ports with an addon controller chip on the motherboard, like a Marvell, so you'd want to double check. If the motherboard has such a chip, you can disable / not use it. It doesn't make the motherboard bad. My motherboard has a 2 chip Marvell that I disabled.

 

Consider drive cages for your build. SuperMicro CSE-M35T-1B are excellent. Helps you avoid cabling issues, the most insidious and common problem here.

Link to comment
  • 2 weeks later...

Thanks for all the great feedback.  You got me to do a bunch of research on SATA cards and I realized my Syba PCIe cards have the Marvell chipsets that are notorious for creating silent corruption.

 

I just bought a Dell H310 card to flash as an LSI 9211-8i, and two SFF to SATA breakout cables, all for ~$50 on eBay.  What a bargain!

 

My read/write speed was appearing to choke around 26 MB/s.  I think that having two of those Syba cards with 4x drives on each may be the bottleneck causing my server to randomly freeze from time to time, or to cause Plex to buffer sometimes.  I'll see after I get the new card and move the drives onto it.

Link to comment
On 22/07/2017 at 10:02 PM, bjp999 said:

 

You'll want to steer clear of HBAs that feature Marvell chips. They have been shown to have problems with virtualization and can cause corruption.

 

Well, no.  SOME Marvell chips give problems in SOME circumstances.  unRAID does not burst in to flames, eat all your data, or kill your cat if you plug in a Marvell chipset controller.

Link to comment
4 hours ago, HellDiverUK said:

 

Well, no.  SOME Marvell chips give problems in SOME circumstances.  unRAID does not burst in to flames, eat all your data, or kill your cat if you plug in a Marvell chipset controller.

 

Really? We're getting several of these a week.

Doesn't have to eat all your data - possibility of subtle, near impossible to find corruption is enough. Cat is safe at least.

Link to comment
2 hours ago, bjp999 said:

 

Really? We're getting several of these a week.

Doesn't have to eat all your data - possibility of subtle, near impossible to find corruption is enough. Cat is safe at least.

 

Data corruption can happen on any OS, using any controller.  A bad cable can royally f**k things up, and the OS is oblivious.

 

That's one of the reasons backups are important.

Edited by HellDiverUK
Link to comment
1 minute ago, HellDiverUK said:

 

Data corruption can happen on any OS, using any controller.  A bad cable can royally f**k things up, and the OS is oblivious.

 

That's one of the reasons backups are important.

 

If you invest in drive cages and do a solid job cabling them in place, the chances of that type of corruption is virtually eliminated.

 

We are talking about something very different here. This is a known incompatible hardware component.

 

I am not trying to cast an overly wide net, but we do not really have a definitive test for which systems are susceptible. @RobJ's initial post on the subject is also a good source of info, but not sure 100% definitive. @Squid is working on this and seems to be making some progress on testing with his Marvel controller, which he says is working perfectly.

 

But at this point, I don't know, and can only advise that users with these questionable controllers inform themselves of the potential risks.

 

An LSI replacement controller for <$50 seems a worthwhile investment to avoid the risk IMHO.

Link to comment

Indeed, I have two Marvell controllers (one is the Supermicro SAS2LP-MV8, the other an elcheapo 92?0) and neither give any trouble.  Nor does the 4-port Marvell that soldered on to my Supermicro X10SBA board.

 

I think the problem is more the firmware revision on the Marvell controller, rather than the chipset in general.

Link to comment
7 hours ago, bjp999 said:

Do you have vt-d enabled in your bios? Created any VMs?

I've confirmed I can get 5 recurring parity errors with 2 particular drives installed on both a SAS2LP and a Startech PEXSAT32 (88SE9128)

 

This has been done on 2 separate motherboards (MSI AM1M, and Asus A88X-Pro)

 

VT-D enabled (and used) on the A88X-Pro without those drives installed has always been a rock-solid experience (LibreElec VM).  I'm not going to run the server in what I believe is a probable state where it will fail on a production system.

 

Prior to the SAS2LP being installed in the secondary server, my primary utilized it (A88X-Pro/USB3.1) (never had those 2 particular drives installed on it), with zero recurring parity errors, zero dropped drives, zero corruption, and VT-D being used for my primary Windows workstation.

 

Edited by Squid
Link to comment
12 hours ago, HellDiverUK said:

I think the problem is more the firmware revision on the Marvell controller, rather than the chipset in general.

My investigation thus far reveals that the primary culprit is the firmware on the drive. (More to the point the ATA interface specification the particular drive / firmware combination adheres to).

Link to comment
12 minutes ago, bjp999 said:

It would be interesting to know if those two drives, connected to a motherboard or non-Marvel chip (e.g., LSI) HBA, perform properly.

Well, yeah they do.  Hence why I never once believed anyone who had parity check slowdowns on a SAS2LP (they weren't connected to the HBA when that was an issue), and I do not have 5 recurring errors, ever.  My servers are all production servers.

 

Anecdotal evidence shows that its a small subset of the users that have issues with Marvel controllers, and now that I've at least proved that a certain combination of hardware causes at least one of the issues (5 parity errors), and by implication a second (corruption), (hardware combination if you haven't perused the other thread is limited right now to original batch ST3000DM001's.  Later ST3000DM001's do not suffer the same problems).  Hopefully with my educated guess of ATA8-ACS as the interface type being present on 3TB+ drives on a Marvel controller causing the problems, persons perusing diagnostics will have another option to suggest to a user as a possible solution.

 

IMHO, telling users to pick up a used LSI HBA doesn't exactly speak wonders...  It's why I bought new SAS2LP's.  Brand new they were the same price as a used LSI with identical performance.  I did just recently pick up a used H200 however, but that was simply because it was $20, and I had a BR10i still in use that needed to be replaced.

Edited by Squid
Link to comment

So it sounds like, at least, you have proof that the Marvel controller did not properly / fully implement the the ATA interface. Either that or there was some ambiguity and everyone else implemented in a way that was compatible with those drives except Marvel.

 

I'm with Johnnie. Used LSI works for me. LSI components are what you tend to find in PROD servers, and I have only heard one person that got a DOA. Every other used one I remember hearing about worked fine.

Link to comment
5 minutes ago, bjp999 said:

So it sounds like, at least, you have proof that the Marvel controller did not properly / fully implement the the ATA interface. Either that or there was some ambiguity and everyone else implemented in a way that was compatible with those drives except Marvel.

 

I'm with Johnnie. Used LSI works for me. LSI components are what you tend to find in PROD servers, and I have only heard one person that got a DOA. Every other used one I remember hearing about worked fine.

But the real issue is that until the fix for SAS2LP and ATA8-ACS drives causing parity check slowdowns was made by LT, none of the "Marvel Issues" happened.  Anecdotal: of course.  Suspicious: very.

Link to comment
26 minutes ago, Squid said:

But the real issue is that until the fix for SAS2LP and ATA8-ACS drives causing parity check slowdowns was made by LT, none of the "Marvel Issues" happened.

 

That should be easy to test, AFAIK the only change was the introduction of the md_sync_thresh tunable, before that it was always equal to md_sync_window-1, so set yours to that value and see if you still get the sync errors.

Link to comment
Just now, johnnie.black said:

 

That should be easy to test, AFAIK the only change was the introduction of the md_sync_thresh tunable, before that it was always equal to md_sync_window-1, so set yours to that value and see if you still get the sync errors.

That's the code change?  jonp implied more took place so that nr_requests could stay at 128 instead of users dropping it down to 8

 

 

Link to comment
8 hours ago, HellDiverUK said:

To be fair, anyone still using the ST3000DM001 deserves everything they get...Marvell controller or not.

 

Well, I still have three running out of six.  Hours are all above 35,000 on all of them.  I had another one 'act up' about a week ago in the +30,000 hours range.  Replaced it, rebuilt the array, then ran a preclear cycle on that disk and it passed without any issues--- not sure what the problem really was.  There is no doubt that these disks had reliability issues.  BackBlaze data may be a bit misleading for us unRAID users as their disks spin 24-7 and the drive temperatures on the high side of the spec limit.  (Just look at pictures of their server and the rack they put it in.)  To be fair, in my situation, the failure is much higher (2 out of 6) than the Hitachi 1TB drives (1 out of 7) which have more hours on them.

 

I don't believe anyone should just be ripping them out because of their somewhat higher failure rate.  At this point, probably most of the failures have already happened and the remaining drives may run for a long time.  Remember that hard drives WILL fail eventually and the only real question is when. 

Edited by Frank1940
Link to comment

I have never replaced a ST3000DM001, ever...  In my experience they are far more reliable than a WD30EFRX.

 

10 of those drives in service.  Average POH of 25,000 hours.  Highest POH is 46,000

 

Big trouble with BackBlaze is:

 

  • They utilize drives in a RAID environment that aren't designed for a RAID environment (isn't that why Red drives exist)?
  • No where do they ever classify what a failure actually is, which makes all of their statistics useless.

 

Link to comment

Let's not forget the class action lawsuit brought against Seagate for low drive reliability! ;) 

 

I am not picking on any drive in particular, but certainly looking at the reliability of drives is something everyone should do in making purchasing decisions. So many factors affect longevity, including heat, temperature differential (between temp spun down and temp in use), high vs low usage, % of time spun down, and "luck of the draw".

 

I would not buy a 3T drive today due to them being expensive / TB, compared to larger drive options. And if I were, neither Seagate nor WD would get my vote.

 

I have personally found HGST drives to be very reliable and they are my first choice.

 

But while I like the HGSTs best, price also enters the equation. So I have all three brands in my servers.

Link to comment

Seagate's biggest problem is a PR problem.  They messed up with the 7200.11's when they tossed enterprise firmware into a consumer drive (the precursor to the "NAS" drives), and then everyone was so surprised when they acted the way they were supposed to with the firmware and disabled themselves when a TLER happened.

 

Personally, I don't expect any drive to last forever, and given that the server's collectively run 30 drives I reasonably expect one to need to be replaced every year.  Over the last 2 years, I consider myself to be lucky, as even my ST2000DM's are hitting 7 years service with zero issues, and I still haven't upgraded out a 1TB Seagate that's been problem free for 8 years.

 

YMMV though.  I consider the whole Seagate vs WD vs HGST akin to the GM vs Ford vs Chrysler.  I would never purchase anything but a GM, as everything else is crap.  But many people have the reverse opinions.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.