System Stability Testing - How do you test and what do you consider "stable"?


Interstellar

Recommended Posts

Right,

 

An interesting point popped up in my head recently (and a quick search didn't show any major discussion on the subject) that:

 

What do you consider "stable"?

 

For example I've had computers that were unstable at stock settings, i.e. motherboard in auto settings.

 

I've also had chips that will do 1GHz above stock and not throw an error for 24 hours.

 

So my questions are:

 

1. How long do you run your stability tests before you consider it stable?

2. What do you test with? (Memtest86+ for memory, obviously, put CPU/Northbridge/Memory Controller?)

3. What temperatures do you test at?

 

 

For example, I have a 3.2GHz X2 Black Edition, which I'm currently testing at 3.2Ghz, at 1.3V (+0.025V above stock) with all four cores unlocked.

I've also tested at 3.4Ghz at 1.3V with just two cores unlocked, which passed 24 hours of Prime95 small FFTs no worries at all.

 

It's hovering around 52C and has passed three hours of Prime95 In-Place FFTs (Max heat, etc) so far (aiming for 24 hours or maybe even 48!).

 

So how do you guys determine what defines "stable!?".

 

In my mind it's simply a case of the probability that the CPU or hardware will throw an error. This probability increases with:

1. Decreased voltage at constant frequency.

2. Increased frequency at constant voltage.

3. Increased temperature at constant voltage and frequency.

 

Given everything else is equal...

 

So it's certain that every single computer at some point in it's life will throw at least one error, but how do you determine the limits!?

Link to comment

I almost always buy workstation to server class hardware and always run everything at stock speeds/voltages because to me overclocking can only make things less stable.  Now my less stable might be another persons stable enough.

 

For my classification of stable, talking servers here because desktop applications throw a wrench in things, I expect my server to be online 100% of the time for 3-4 years with no failures.

Link to comment

I almost always buy workstation to server class hardware and always run everything at stock speeds/voltages because to me overclocking can only make things less stable.  Now my less stable might be another persons stable enough.

 

For my classification of stable, talking servers here because desktop applications throw a wrench in things, I expect my server to be online 100% of the time for 3-4 years with no failures.

 

So how did you confirm that your hardware is all working ok? You assume it is?

 

Just because it's server class hardware doesn't mean you could have got a duff one?

 

 

Link to comment

My generally accepted time-frame for a fairly solid stability test is 8-24 hours. If time isn't so lux, around 1-2 hours. If you start a test before you go to bed and when you get up in the morning it still hasn't crashed, it probably isn't going to. You either need to run a different test or change your settings around and go again.

 

I've seen memory tests fail on the 150th pass (~3 days) and drives pass manufacturer tests but when you benchmark them, there is a noticeable issue with the performance such as a giant dip in read/write speed during benchmarking. When problems like this come up it's often faster to start replacing parts until things start working again then work your way back.

 

It does make you wonder though, how long is "too long" to run a memory test? To the point that a random read/write error is likely.. Even RAM that would be considered perfectly fine would fail a memory test, given enough time.

 

Some testing stuff:

 

  • CPU: StressCPU, Prime95, IntelBurnTest
  • CPU Temperature: VERY dependent on what CPU you have, generally though; ~ 40C or less on idle and < 60C under load is ideal
  • Memory: Memtest86+, Windows memory test
  • HDD: CHKDSK, Manufacturers tests; Seatools, WDDiag, HUtil (crap) & HDTune, full zeros/long copies, md5sum
  • Motherboard: Reset BIOS to defaults and/or flash BIOS. Otherwise process of elimination
  • PSU: A dedicated PSU tester and/or multimeter, alternatively another (working) PSU

 

Other than that stick to brand name parts, e.g. a thermaltake/corsair 450W PSU is much better than a generic.

 

 

Link to comment

I personally do everything I think my machine should be doing. Another words in the case of an unRAID server.

 

I copied files to it as I would normally do it. Heck I even copied, moved, deleted on multiple machines at the same time

When I was testing its ability to serve files. I fired up 4 machines and connected to it and watch 4 different 720 video files at the same time.

 

I did this over a course of a few days and watched its temps. I ran multiple parity checks. I purposely disconnected a drive to simulate a failed drive and had it rebuild.

 

After 4 or so days I decided it passed in my mind every possible problem/situation I would need and stamped it done.

 

Did I over/under clock it? NO, because it wasn't my intention nor designed need.

Link to comment

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

 

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough.  Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load.  Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

Link to comment

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

 

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough.  Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load.  Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

 

But that doesn't put the CPU/Northbridge/Memory Controller under any load what so ever? Memtest does NOT fully test a memory controller either...

 

Unless you hammer your hardware at 90% of their CPU Tcase limits then it could be unstable. [And to be honest you should test at just under Tcase limit]

 

Given running prime95 under linux is so easy. If your sending systems off to customers surely spending 24 hours with it running blend at 90% of the CPUs thermal limits is surely the minimum you should do?

 

I tested mine at 55C load constant (62C limit for my chip I believe), then to be double sure I've upped the under load fan speeds so even if the room doubles in temp (20->40C) then the CPU will still be at 55C or lower.

 

In addition to that I will be testing for 24 hours at each C state (which I've customised for low power usage - C3 @ 800mv not 1000mv!) so I know every state, frequency, voltage and temperature my CPU is 24 hours or more stable :/

Link to comment

we don't overclock/underclock/change anything like that on our unRAID systems.

 

memtest will sufficiently test memory for our needs, and the constant disk activity along with the network activity we test in our file copy and verify operations.

 

I believe in testing but not to the extent you are going to for a server.  For a desktop gaming machine I might do slightly different sweet of testing but probably would not go to the extent you are.

 

Your testing would add more than a week(ish) of testing on top of what we already do.

Link to comment

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

 

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough.  Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load.  Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

 

I plan to use badblocks on all my new drives from now on.  4 passes of specific bit patterns should suffice to exercise the drive enough.

 

You would be surprised at how many drives will timeout while attempting to error correct. It's one way to weed out questionable drives.  The badblock test does 4 patterns of

 

0xaa  10101010

0x55  01010101

0xff  11111111

0x00  00000000

 

I plan to add a custom test that will create a file on the filesystem over the lan with each specific bit pattern to fill a drive.  Then compare a calculated md5sum vs what is read back.

Link to comment

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

 

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough.  Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load.  Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

 

But that doesn't put the CPU/Northbridge/Memory Controller under any load what so ever? Memtest does NOT fully test a memory controller either...

 

Unless you hammer your hardware at 90% of their CPU Tcase limits then it could be unstable. [And to be honest you should test at just under Tcase limit]

 

Given running prime95 under linux is so easy. If your sending systems off to customers surely spending 24 hours with it running blend at 90% of the CPUs thermal limits is surely the minimum you should do?

 

I tested mine at 55C load constant (62C limit for my chip I believe), then to be double sure I've upped the under load fan speeds so even if the room doubles in temp (20->40C) then the CPU will still be at 55C or lower.

 

In addition to that I will be testing for 24 hours at each C state (which I've customised for low power usage - C3 @ 800mv not 1000mv!) so I know every state, frequency, voltage and temperature my CPU is 24 hours or more stable :/

 

I might agree that if I were building allot of machines I would spend a day or so running prime95 while doing other tests. It does tax the machine in unique ways and I've seen it catch questionable builds.

Link to comment

we don't overclock/underclock/change anything like that on our unRAID systems.

 

memtest will sufficiently test memory for our needs, and the constant disk activity along with the network activity we test in our file copy and verify operations.

 

I believe in testing but not to the extent you are going to for a server.  For a desktop gaming machine I might do slightly different sweet of testing but probably would not go to the extent you are.

 

You testing would add more than a week(ish) of testing on top of what we already do.

 

Irrelevant if they are at stock, over clocked or anything.

 

I've had systems unstable at load at stock settings.

 

Memtest86+ tests the stick individually, which hardly produces a load on the memory controller.

 

My tests on my setup has shown that:

1. memtest86+ is 24 hours stable with 4 sticks at auto settings.

2. CPU is stable to 72 hours prime95 and using Linpack set at 1GB ram for 50 passes.

3. Is not stable when you run Linpack using 7.2GB in windows (i.e. maximum CPU, Northbridge and memory IO loads), which clearly shows it is the memory controller that is having issues with 4 sticks.

 

Bumping up the CPUs NB voltage to 1.2V (instead of 1.175V) or reducing the timings/frequency solves this problem.

 

Auto mode does *not* imply stability I'm afraid. :P

Link to comment

Thanks for the feedback, everyone.  While we haven't seen any defective CPUs to date, that certainly doesn't mean that it can't happen.  We'll consider adding prime95 into our testing suite.  We current burn in each server we sell for approximately two weeks before considering it ready to ship.  However, most of our tests focus on the SATA controllers (both on the motherboard and PCIe cards), network controller, power supply, and RAM as these are the most common areas of issues and hardware incompatibility in unRAID servers.

 

Weebo, we're definitely considering adding badblocks into our testing suite as well, although I'm curious to hear your response to Joe's concern about badblocks only testing the drive's cache.  Thoughts?

 

Concorde Rules: Keep in mind that most unRAID servers are designed for stability and reliability as a NAS and nothing else.  Most servers won't see above 10% CPU usage in their entire lifespan, and the RAM cache is often bypassed as large media files are copied to and from the server (remember that a server with 1 GB of RAM or 8 GB of RAM will perform identically when a 40 GB BluRay ISO is being written to the array).  Essentially, the Northbridge, CPU, and RAM are vastly underutilized in most cases (case in point - how many other modern systems do you know that can run well on just 512 MB of RAM?).  It is absolutely feasible that an unRAID server with a CPU that is unstable at 90%+ utilization or a memory controller that has issues with 4 sticks of RAM can live a long and happy life without the user ever having a single problem.  It is the Southbridge, SATA controllers, and NIC that are under load and need to be extensively tested.  While I'm all for running tests when it is trivial to do so, you also have to keep in mind your return on investment (in terms of time and labor).

Link to comment

Concorde Rules: Keep in mind that most unRAID servers are designed for stability and reliability as a NAS and nothing else.  Most servers won't see above 10% CPU usage in their entire lifespan, and the RAM cache is often bypassed as large media files are copied to and from the server (remember that a server with 1 GB of RAM or 8 GB of RAM will perform identically when a 40 GB BluRay ISO is being written to the array).  Essentially, the Northbridge, CPU, and RAM are vastly underutilized in most cases (case in point - how many other modern systems do you know that can run well on just 512 MB of RAM?).  It is absolutely feasible that an unRAID server with a CPU that is unstable at 90%+ utilization or a memory controller that has issues with 4 sticks of RAM can live a long and happy life without the user ever having a single problem.  It is the Southbridge, SATA controllers, and NIC that are under load and need to be extensively tested.  While I'm all for running tests when it is trivial to do so, you also have to keep in mind your return on investment (in terms of time and labor).

 

From my observations even with my CPU at 3.2GHz shfs uses >50% of a core and if your really hammering it with multiple connections hits 80% of total load of a dual core.

 

Unless my server is broken it certainly doesn't sit at 10% or less all it's life, and besides, this is a what if situation. One day the client might hammer the CPU...

 

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

 

Really hammer it then if your also doing I/O tasks while all cores are fully loaded :P

Link to comment

Weebo, we're definitely considering adding badblocks into our testing suite as well, although I'm curious to hear your response to Joe's concern about badblocks only testing the drive's cache.  Thoughts?

 

I have not seen that comment... But I'll have to search for it. If you know where it is please post a link.

Link to comment

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

 

Really hammer it then if your also doing I/O tasks while all cores are fully loaded :P

 

I think a prime95 test while the disk stress test should be good enough. I wouldn't turn off fans so the inside gets hot.

I've never done that and do not feel the need to, nor would I. It's not a real world case.

If a fan dies, the fan should be replaced.

 

From what I've seen intel chips these days are not as prone to high stress failure like many years ago.

Where I do see issues is in heatskin/paste/thermal handling. I have a couple of boards where the bios will start to beep when i stress out the machine heavily. On those I think the heatsink compound is not good or aged. That's a pretty valid point for testing with a CPU intensive application. As is memory bandwidth exercising.

Link to comment

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

 

Really hammer it then if your also doing I/O tasks while all cores are fully loaded :P

 

I think a prime95 test while the disk stress test should be good enough. I wouldn't turn off fans so the inside gets hot.

I've never done that and do not feel the need to, nor would I. It's not a real world case.

If a fan dies, the fan should be replaced.

 

From what I've seen intel chips these days are not as prone to high stress failure like many years ago.

Where I do see issues is in heatskin/paste/thermal handling. I have a couple of boards where the bios will start to beep when i stress out the machine heavily. On those I think the heatsink compound is not good or aged. That's a pretty valid point for testing with a CPU intensive application. As is memory bandwidth exercising.

 

Well I suppose if you have the shutdown limit below the CPU's upper limit then your covered I suppose.

 

I just feel for such an important machine if I know it can survive operating at the utter limit for 72 hours then when it gets there by accident it will be ok.

Link to comment

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

 

Really hammer it then if your also doing I/O tasks while all cores are fully loaded :P

 

I think a prime95 test while the disk stress test should be good enough. I wouldn't turn off fans so the inside gets hot.

I've never done that and do not feel the need to, nor would I. It's not a real world case.

If a fan dies, the fan should be replaced.

 

From what I've seen intel chips these days are not as prone to high stress failure like many years ago.

Where I do see issues is in heatskin/paste/thermal handling. I have a couple of boards where the bios will start to beep when i stress out the machine heavily. On those I think the heatsink compound is not good or aged. That's a pretty valid point for testing with a CPU intensive application. As is memory bandwidth exercising.

 

Well I suppose if you have the shutdown limit below the CPU's upper limit then your covered I suppose.

 

I just feel for such an important machine if I know it can survive operating at the utter limit for 72 hours then when it gets there by accident it will be ok.

 

It makes sense and warrants discussion.  If I were building machines for sale, I might do that.

Ideally I would build some form of alarming system into the machine or via software monitoring.

Link to comment

I almost always buy workstation to server class hardware and always run everything at stock speeds/voltages because to me overclocking can only make things less stable.  Now my less stable might be another persons stable enough.

 

For my classification of stable, talking servers here because desktop applications throw a wrench in things, I expect my server to be online 100% of the time for 3-4 years with no failures.

 

So how did you confirm that your hardware is all working ok? You assume it is?

 

Just because it's server class hardware doesn't mean you could have got a duff one?

 

 

 

I assume the CPU and motherboard work if they boot.  I don't know if Supermicro etc have better QA but I have yet to have a problem.  I typically buy open-box specials too...

 

As for memory, most use ECC RAM.  If one is bad it gets flagged as such and either continued to be used and corrected or disabled from use entirely.  IE, if one is bad it tells me, no BSOD here.  Sometimes I run a 1-2 pass memtest, sometimes I don't.

 

Never had bad RAM bring one of my machines down (I have had a couple bad sticks since I buy almost 100% used off of ebay) and never had bad RAM bring down one of the servers I deal with at work (currently 60+ servers and 8+TB of RAM).  ECC is nice in this regard.

 

As the saying goes, you get what you pay for.

 

 

 

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.