SM X10SL07 vs X9SCM-IIF (evolves into E3C224-4L and ECC discussion)


Recommended Posts

I enjoy these little debates. :)

 

I suppose it depends on whether or not $40 is a significant amount of money to you.  If you're at the point where $40 makes a difference in your ability to build the server with ECC, I'd suggest you can't afford to build it in the first place.

 

I think I've been insulted. The question isn't can a person afford the $40. The question is whether the person receives $40 of value relative to other uses of the $40. I would only consider spending this $40 after investing in everything else in my list below. And I still think that, if the worst were to happen, the amount of data loss I'd encounter would likely cost < $40 to recover from. This $40 does NOT protect you from copying bad data to the array. It does not mean that you will have no data loss. It just means that it is less likely. And how much more reliable is your array? 1/100%? 1/25% 1/4%? 1/2% 1%? 10%? I'm way on the left side of this scale.

 

Further, your first suggestion doesn't improve reliability -- it reduces it !!  ANY electrical connection adds to the risk of failure.  A hot-swap cage may make it easier to swap drives; but it adds another connection to the mix => you still have a cable connecting to a SATA connection (on the back of the cage); but you also have the additional connection to the drive.  Electrically it's actually more reliable to NOT use the cage.  (although I agree it adds the risk of "bumping" other cables if/when you need to replace a drive.)

 

I've used lots of hot-swap cages over the years -- I'm happy to take that small added electrical risk; but certainly realize that it IS an added point of failure.

 

I think you are just being contrary here. You want to argue about the word "reliable"? You agree you are willing to take the "miniscule risk". The advantage is huge when it comes time to swap a disk. Admit defeat on this one.

 

Most of your other suggestions should ALWAYS be done, regardless of what kind of memory you buy.  Locking SATA cables;  good cooling;  and a UPS are all MANDATORY items if you care about your data; and buying high quality drives is certainly a good idea.  And of course if you're using VMs, an SSD is excellent for hosting them -- and you should certainly buy a CPU with enough "horsepower" to do whatever tasks you plan to ask it to do.

 

Mandatory for you, recommended by me. Tomato, tomato. Either way, a better investment than ECC memory.

 

As for buying your computer pre-built => that's entirely a matter of choice -- I agree if you aren't experienced and/or have ANY reservations about putting one together it's a good idea to let a professional do it for you.

 

As you and I know, stop by general support and you'll find a collection of "thought they were up to the task" system builders that would have been better off leaving the driving to Greenleaf. People should not confuse a 20, 12 or even 8 drive unRAID build with building a workstation with an SSD, 1 spinner, and a souped up video card.

 

One other note:  You indicated "... For $40-$80 (cost of ECC over conventional for 16G and 32G memory) "  ==>  the latter number implies you're considering installing 32GB of memory (e.g. 4 8GB modules).  Installing 4 modules of unbuffered RAM significantly increases the likelihood of memory errors -- the bus loading with that many loads results in very degraded signaling waveforms ... making errors far more likely.  I NEVER install 4 modules on an unbuffered board unless they're single-sided modules (in which case 4 modules has the same load as 2 double-sided modules -- but 8GB modules are going to be double-sided.)

 

Four modules will work ... but that's an even stronger reason to use ECC modules !!

 

I have 32G in my workstation (4x8G identical G.Skill RAM). Works flawlessly. I sleep easy. ;)

Link to comment

4 modules generally work just fine -- but they are absolutely NOT nearly as reliable as 2 modules when you're using unbuffered RAM.  It's also likely that your BIOS doesn't run the 4 modules as aggressively as it would 2 modules -- it likely adds a cycle to the latency and/or runs it at a lower speed.    If you have an oscilloscope, take a look at the waveform on your memory bus with 2 modules installed;  then again with 4 installed.    There's a reason virtually all commercial servers use buffered RAM [with ECC of course  :) ]

 

As for the extra connection decreasing reliability when you use hot-swap cages -- that's simply a basic fact.  EVERY connection adds some electrical resistance and reduces overall system reliability.  I readily concur that it's very small -- as I noted, I've used hot-swap cages for years -- but it does NOT "increase reliability" (which is what you claimed).  It does increase convenience ... but that's certainly not the same as reliability.  And as for "value" (which you focused on) ... consider:  I spent ~ $440 for the 4 5-in-3 cages in my main media server.  In the past 5 years, I've replaced 2 drives.    So from a "value" perspective I spent $220/drive for the convenience of not having to spend 5 minutes to pop off the side and replace a drive internally  :)

 

Didn't mean to insult you over $40 => my point is simply that this is a trivial cost, and it doesn't seem to make any sense to use less reliable memory for that difference.    I can't imagine that $40 matters for anyone building a reasonably equipped server (which was my point).  Otherwise why bother buying a server grade motherboard with ECC support?

 

 

 

Link to comment

I get your points.

 

I guess my point is that added protection is not the only consideration. Once you get to a sufficiently high level other factors come into play. So if the chances of your array being stolen over a ten year period is 0.2%, and the chances of your house burning down is 0.1%, and the chances of having a memory error with non-ECC memory is 0.0001%, is it better to change your locks or invest in alarm system or to add ECC memory?

 

I will quibble on the reliability topic because I am pretty passionate about easy drive removal. I will have to concede that adding an additional connection may ever so slightly decrease the reliability of the server UNTIL you go to do a disk replacement. Then you are subjecting yourself to geometrically higher risk of having a problem that can lead to data loss.  I include this type of scenario in my concept of reliability.

Link to comment

Interesting debate. I can tell you from experience, memory errors can be silent and they can be deadly if a memory area is corrupt in a location that updates the superblock.  I've seen good memory go bad silently destroying a filesystem.

 

For a small system you do not care about, no big deal, restore/reload/reconfigure you're done. You do have backups right?

For an archival storage array you care about, it's penny wise and hour foolish if you can actually afford it.

 

I think it matters more when you start getting into huge amounts of ram. More data is cached, more data has the potential for bitrot in memory. (lol).

 

Yes memtest can detect an error, but only if you are running it. if you are silently corrupting your data or better yet, meta data, you have absolutely no warning until the kernel notices something or crashes.

 

While memory has gotten better these days, the potential for errors still exist.

 

When I had a large workstation system with FBDIMM, I was amazed at how many memory errors were reported.

The memory was quality recommended ram on a supermicro board.  It wasn't bad. It passed every memtest I threw at it. However it was also up 24x7x365 for years on end.

Our sun servers warn us when there is a problem, thus protecting us before we crash or destroy something important.

Link to comment

Interesting debate. I can tell you from experience, memory errors can be silent and they can be deadly if a memory area is corrupt in a location that updates the superblock.  I've seen good memory go bad silently destroying a filesystem.

 

For a small system you do not care about, no big deal, restore/reload/reconfigure you're done. You do have backups right?

For an archival storage array you care about, it's penny wise and hour foolish if you can actually afford it.

 

I think it matters more when you start getting into huge amounts of ram. More data is cached, more data has the potential for bitrot in memory. (lol).

 

Yes memtest can detect an error, but only if you are running it. if you are silently corrupting your data or better yet, meta data, you have absolutely no warning until the kernel notices something or crashes.

 

While memory has gotten better these days, the potential for errors still exist.

 

When I had a large workstation system with FBDIMM, I was amazed at how many memory errors were reported.

The memory was quality recommended ram on a supermicro board.  It wasn't bad. It passed every memtest I threw at it. However it was also up 24x7x365 for years on end.

Our sun servers warn us when there is a problem, thus protecting us before we crash or destroy something important.

 

Ill probably get burned sooner or later but by the time all these arguments for ECC came out I had already committed.

 

I believe that monthly parity checks would highlight memory errors (I'd see parity location start randomly flipping back and forth on consecutive checks). MD5 checks would also detect memory issues if randomly they started mismatching and then on verification it was good.

 

Generally I find memory in my computer works pretty well, its the memory in my head that needs the ECC parity protection. ;)

 

 

Link to comment

... by the time all these arguments for ECC came out I had already committed.

 

?? !!!

 

Then why did you even ask for opinions -- TWO DAYS ago ??

 

Last decision. What are advantages of ECC over regular?

 

... when you also indicated you had not yet decided  ...

 

I am considering it.

 

Sure sounds like you're not at all "committed" => you're easily within the return window for anything you ordered in the last two days !!

 

Like WeeboTech, I've seen MANY instances over the years of random memory bit errors causing corruption that wasn't at all obvious until well after they had occurred -- when something was accessed that had unexplained errors in it;  when a computation wasn't correct -- but wasn't noticed until later; etc.    In fact, I'd say most errors do NOT cause system crashes -- a bit error in any data being processed isn't likely to result in a crash;  only an error in an executable instruction.    Systems that correct and log errors will, as WeeboTech noted, show just how often these random events occur.  An IEEE study about a decade ago concluded there was a 96% change of getting a random bit flip error every 3 days due to cosmic rays, but that ECC memory reduced the probability to one chance in six billion.  This was based on atmospheric neutrons at ground level -- higher elevations would have slightly higher error rates.    If you have access to IEEE documents, the study was called "Single event upset at ground level" by E. Normand (latest revision is from 2002).

 

I agree that for small systems it's not a big deal as long as you're fully backed up -- as WeeboTech noted, "... just restore/reload/reconfigure and you're done."    But as I said several times, I simply don't understand why anyone would spring for a server grade board with ECC support and then NOT use ECC memory.

 

Link to comment

I asked the question early in the morning and the serious arguments didn't happen for 5 hours. I needed the memory so ordered it an hour or so after the question was asked that morning.

 

I do have another system that could use a memory upgrade from 4T and does not take ECC. I would consider moving the desktop memory there and buying ECC in the future (hence my "considering it" comment).

 

Still feeling pretty ambivalent. Its hard to get excited about paying 1/3 more for something when all of your personal experiences points to the less expensive option. But I hear you guys and know you are passionate about this topic and appreciate the examples and concern for my array's health. But I'll probably stick with this for now, and think about jumping to ECC when I go from this 2x8G to 4x8G RAM.

Link to comment

Understand -- although it sounds like you weren't seriously seeking advice with that timing.

 

One final note:  You indicated you were comfortable just running MemTest to check your memory.  Note that the majority of correctable bit errors don't happen with BAD memory modules ... they're simply a consequence of the technology of dynamic RAM modules, and are caused by atmospheric disturbances, which can cause a bit flip on a dynamic RAM module or simply noise on the memory bus that results in the same thing (the latter is even more likely when you have more than 2 modules installed).    MemTest will find bad modules, but isn't likely to even notice these random bit errors that only occur once every few days.    MemTest would only report an error if it happened to be testing the specific memory block where the error occurred at the time ... and would in fact pass that same block on subsequent passes, so you'd likely write it off as a "fluke".    When a module actually fails, it really doesn't matter if you have ECC or not -- but that type of failure will also almost certainly cause a crash, so the system will simply go down, with no undetected errors introduced into your data.  It's the consequences of those single-bit errors that ECC protects against -- and these DO happen, whether you like to acknowledge that or not -- that are the reason it's a good idea to use ECC modules.

 

 

Link to comment

If you remember, the nature of my question was not which should I buy. The question was about features that required ECC memory. I had read that some file systems (like ZFS) required it. I knew unRaid did not. I was always planning to buy non-ECC memory. Perhaps the initial question was a bit broad but I did clarify this in my next response.

 

I thank you again for your insights and opinion. I learned a lot about ECC memory and I'm hopeful the community learned a few things too including this valuable lesson - to NEVER announce you are putting non-ECC memory on a server MB. Lie brothers lie!!

 

Peace.

Link to comment

bjp998 > LOL!!!! Now that was funny.

 

Garycase, That's some interesting information about ECC.

 

FWIW, when I had a 4GB server I did not use ECC memory.

 

When I went to 8GB it was cost effective enough for ECC, and it trapped some memory errors.

On my workstation, there were plenty of notifications that made me feel it was worth the expenditure to not worry about corruption.

 

my feeling is,  when we install higher amounts of memory in a server class machine that is up constantly, it's more important to consider the expenditure.  My time is valuable.  The time I spend fixing something means time I cannot spend consulting and making money.

 

md5sum and parity checks will tell you something is wrong, but not where. Until you reboot with memtest.

 

If it's critical filesystem meta data, it may be really hard to recover.  I mention this part because the chance of the data being cached is higher as the amount of ram goes up.

 

While Garycase is very passionate about it, I'm ambivalent also.

In my case, I'll do what will save me time, or a headache, in the future.

Link to comment

Even workstations can benefit from ECC.  I strongly suspect that MANY of the unexplained crashes on workstations (which folks usually attribute to "another Windows anomaly") are in fact due to a random bit error that wouldn't have happened with ECC RAM.    There are two things that really bug me about desktop boards and chipsets => (1) they almost never support ECC (some AMD boards do); and (2) they never (at least I'm not aware of any) support buffered memory.    I'd use both in a heartbeat if they were available.

 

I'm not "passionate" about using ECC ... most of my systems do not -- but only because it's not an option.  I am passionate about using it when you have a motherboard that supports it.    I've gone for rides with a friend who collects classic cars in some of his old cars that don't have seat belts (I love his 1910 Buick)  ... but whenever I'm in a car that has them, I absolutely buckle up !!

 

Link to comment

md5sum and parity checks will tell you something is wrong, but not where. Until you reboot with memtest.

 

Rebooting with memtest likely won't tell you either unless you've actually had a module fail.  Random bit errors from cosmic rays, sunspots, etc. -- the most common causes of atmospheric-induced errors -- that may have caused corruption aren't going to be caught by memtest, since they're already happened, and the memory isn't actually bad.

 

The way to identify where the issue is when this happens depends on how much information you have on the corruption.  If you have md5's, then you'll know which file is corrupted - so just replace it from your backups.  If it's just a parity check error and you don't have CRCs/MD5s, then you'll need to do a complete comparison against your backups to isolate the issue.

 

Link to comment

Cosmic rays :o .  LOL! That cracks me up. ;D

FWIW it has some merit. Since I'm very sensitive to energy and non corporeal energy beings, I find this plausible.

In my case, I had always thought it was a heat issue.

So whenever I could, I would spend on the heat spreaders and if possible, ECC. 

Link to comment

The Boeing study showed that the majority of random bit flip errors are due to atmospheric disturbances.  Actual memory failures (i.e. a module that actually fails) are more likely due to heat-related issues.  But those aren't what you're really protecting against with ECC -- an actual module failure will almost certainly result in a system crash ... so you'll clearly notice it;  confirm it with Memtest; and replace the module.    It's the random single bit errors that you won't notice that ECC protects against.

 

You're not paying extra to protect against cosmic rays -- you're paying extra to protect your data from random bit errors.    The cause is almost irrelevant ... it's something that happens - and you either want to protect against it or not.    As WeeboTech has already noted, when you look at the logs from ECC modules, you'll see that these errors DO happen with alarming frequency ... and you'll simply never know that when you use non-correcting modules.  These are NOT something Memtest will find for you.

 

On my workstation, there were plenty of notifications that made me feel it was worth the expenditure to not worry about corruption.

 

Clearly for a typical home user, it's no big deal if some data is corrupted -- as long as the important data is backed up, you just need to spend the time identifying the corrupted data; then restoring it from backups.    My point all along hasn't been that you should only buy systems that support ECC memory (although I do that) -- it's that if you DO buy a server class board with ECC support, it seems penny wish/pound foolish to not use that capability.  As WeeboTech noted:

 

I'll do what will save me time, or a headache, in the future.[/font]

 

... and certainly avoiding the corruption from random errors seems to fit that criteria rather well  :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.