Jump to content

Non-ECC Ram and File structure corruption


DoeBoye

Recommended Posts

Posted

Hello All!

 

So I am in the process of converting all my rfs disks over to xfs and twice now I have come across file structure corruption on a disk that needs to have a rebuild-tree to fix it. This is after all disks were checked and cleared with reiserfsck.

 

I'm assuming I probably have a bad cable or backplane, and will be looking into that tonight. That said, garycase's enthusiastic championing (;D) of the whole ecc ram argument has finally won me over! Especially since I've been looking for a legitimate excuse to upgrade what is otherwise, a fairly solid system  ;).

 

Now my question is, could this corruption also be occurring because of all the massive data transfers occurring and the use of 4 sticks of non-ecc ram?

 

ALSO, I asked a question last week, and I received no response so I thought I'd try again:

 

Using the GUI to do the reiserfsck, it asked me to type 'Yes' before beginning the tree rebuild and fix fixable. Where would I type 'Yes'? I tried just about everything. I ended up just going through putty to do it.

 

Thanks!

Posted

I'll add two thoughts ...

 

=>  As you know, I'm a big believer in using error correcting RAM in servers, as this adds yet-another layer of fault tolerance, and effectively eliminates sporadic RAM bit errors.    Whether it's worth a complete system upgrade just to get this is, of course, another question => but WHEN you do an upgrade I'd certainly spend the extra $$ to have an ECC based setup.

 

=>  Using 4 RAM unbuffered RAM modules significantly increases the likelihood of random bit errors.  I NEVER do this unless it's on a system with ECC.    You might try reducing your RAM to 2 modules [either cutting your total RAM in half, or using larger modules to get the same total with just 2 sticks, depending on what you have now; how large a module your system supports; and whether you want to buy additional modules to do this].    If you're not familiar with the effects of bus loading, watch item #10 here:  http://www.xlrq.com/stacks/corsair/153707/index.html

 

 

Posted

I'll add two thoughts ...

 

=>  As you know, I'm a big believer in using error correcting RAM in servers, as this adds yet-another layer of fault tolerance, and effectively eliminates sporadic RAM bit errors.    Whether it's worth a complete system upgrade just to get this is, of course, another question => but WHEN you do an upgrade I'd certainly spend the extra $$ to have an ECC based setup.

 

=>  Using 4 RAM unbuffered RAM modules significantly increases the likelihood of random bit errors.  I NEVER do this unless it's on a system with ECC.    You might try reducing your RAM to 2 modules [either cutting your total RAM in half, or using larger modules to get the same total with just 2 sticks, depending on what you have now; how large a module your system supports; and whether you want to buy additional modules to do this].    If you're not familiar with the effects of bus loading, watch item #10 here:  http://www.xlrq.com/stacks/corsair/153707/index.html

 

 

 

Thanks for the response! Very interesting slideshow. I always knew using 4 dimms could be problematic, but my system has been rock solid for so long, it never seemed to be a problem...

 

Now, as far as the type of errors that the ram could be creating. Could this be causing my corrupted file system, or would the errors present themselves in a different fashion (broken/unreadable files etc)?

Posted

Now, as far as the type of errors that the ram could be creating. Could this be causing my corrupted file system, or would the errors present themselves in a different fashion (broken/unreadable files etc)?

Bad memory can cause all sorts of issues, not limited to any specific type. A bit that should be a 1 read as a 0 or vice versa will reek havoc with anything in a PC.
Posted

... Could this be causing my corrupted file system, or would the errors present themselves in a different fashion (broken/unreadable files etc)?

 

Yes, of course it could be causing this.  As Jonathan noted, an incorrect bit can cause any number of things => a total system crash; data corruption; display corruption; etc.    Modern memory systems are VERY reliable ... but they still have a notable incidence of errors -- 3.751/DIMM according to a recent large scale study by Google ... but the vast majority of these are correctable => i.e. if you had ECC they'd be eliminated.    The rate is notably higher on systems with more than 2 unbuffered modules installed ... which makes it even more important to have ECC.

 

The best memory systems use buffered modules with ECC => this effectively eliminates bus loading AND provides error correction.    To do that you'd need to use an E5 series Xeon and supporting motherboard, and buy the more expensive buffered modules.    For most UnRAID setups, a server class motherboard that supports ECC, along with a supporting CPU, is fine as long as you don't need more than 32GB of RAM (i.e. 4 8GB modules) ... and for best reliability you could restrict it to 16GB (2 8GB modules).

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...