System Stability Testing - How do you test and what do you consider "stable"?

Interstellar · November 24, 2011

Right,

An interesting point popped up in my head recently (and a quick search didn't show any major discussion on the subject) that:

What do you consider "stable"?

For example I've had computers that were unstable at stock settings, i.e. motherboard in auto settings.

I've also had chips that will do 1GHz above stock and not throw an error for 24 hours.

So my questions are:

1. How long do you run your stability tests before you consider it stable?

2. What do you test with? (Memtest86+ for memory, obviously, put CPU/Northbridge/Memory Controller?)

3. What temperatures do you test at?

For example, I have a 3.2GHz X2 Black Edition, which I'm currently testing at 3.2Ghz, at 1.3V (+0.025V above stock) with all four cores unlocked.

I've also tested at 3.4Ghz at 1.3V with just two cores unlocked, which passed 24 hours of Prime95 small FFTs no worries at all.

It's hovering around 52C and has passed three hours of Prime95 In-Place FFTs (Max heat, etc) so far (aiming for 24 hours or maybe even 48!).

So how do you guys determine what defines "stable!?".

In my mind it's simply a case of the probability that the CPU or hardware will throw an error. This probability increases with:

1. Decreased voltage at constant frequency.

2. Increased frequency at constant voltage.

3. Increased temperature at constant voltage and frequency.

Given everything else is equal...

So it's certain that every single computer at some point in it's life will throw at least one error, but how do you determine the limits!?

brian89gp · November 25, 2011

I almost always buy workstation to server class hardware and always run everything at stock speeds/voltages because to me overclocking can only make things less stable. Now my less stable might be another persons stable enough.

For my classification of stable, talking servers here because desktop applications throw a wrench in things, I expect my server to be online 100% of the time for 3-4 years with no failures.

Interstellar · November 25, 2011

I almost always buy workstation to server class hardware and always run everything at stock speeds/voltages because to me overclocking can only make things less stable. Now my less stable might be another persons stable enough.

For my classification of stable, talking servers here because desktop applications throw a wrench in things, I expect my server to be online 100% of the time for 3-4 years with no failures.

So how did you confirm that your hardware is all working ok? You assume it is?

Just because it's server class hardware doesn't mean you could have got a duff one?

chickensoup · November 26, 2011

My generally accepted time-frame for a fairly solid stability test is 8-24 hours. If time isn't so lux, around 1-2 hours. If you start a test before you go to bed and when you get up in the morning it still hasn't crashed, it probably isn't going to. You either need to run a different test or change your settings around and go again.

I've seen memory tests fail on the 150th pass (~3 days) and drives pass manufacturer tests but when you benchmark them, there is a noticeable issue with the performance such as a giant dip in read/write speed during benchmarking. When problems like this come up it's often faster to start replacing parts until things start working again then work your way back.

It does make you wonder though, how long is "too long" to run a memory test? To the point that a random read/write error is likely.. Even RAM that would be considered perfectly fine would fail a memory test, given enough time.

Some testing stuff:

CPU: StressCPU, Prime95, IntelBurnTest
CPU Temperature: VERY dependent on what CPU you have, generally though; ~ 40C or less on idle and < 60C under load is ideal
Memory: Memtest86+, Windows memory test
HDD: CHKDSK, Manufacturers tests; Seatools, WDDiag, HUtil (crap) & HDTune, full zeros/long copies, md5sum
Motherboard: Reset BIOS to defaults and/or flash BIOS. Otherwise process of elimination
PSU: A dedicated PSU tester and/or multimeter, alternatively another (working) PSU

Other than that stick to brand name parts, e.g. a thermaltake/corsair 450W PSU is much better than a generic.

kizer · November 26, 2011

I personally do everything I think my machine should be doing. Another words in the case of an unRAID server.

I copied files to it as I would normally do it. Heck I even copied, moved, deleted on multiple machines at the same time

When I was testing its ability to serve files. I fired up 4 machines and connected to it and watch 4 different 720 video files at the same time.

I did this over a course of a few days and watched its temps. I ran multiple parity checks. I purposely disconnected a drive to simulate a failed drive and had it rebuild.

After 4 or so days I decided it passed in my mind every possible problem/situation I would need and stamped it done.

Did I over/under clock it? NO, because it wasn't my intention nor designed need.

Rajahal · November 26, 2011

Other than that stick to brand name parts, e.g. a thermaltake/corsair 450W PSU is nowhere near as good as a generic.

I'm guessing that's a typo?

Interstellar · November 28, 2011

Well my server managed to do 72 hours with no errors at 1.3V with all cores unlocked. Winner!

A parity check ran with no errors, so I'll leave it like this for two weeks, run another check and then up the memory speed again (to what I know the memory is stable at) and test again.

prostuff1 · November 28, 2011

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough. Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load. Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

Interstellar · November 28, 2011

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough. Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load. Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

But that doesn't put the CPU/Northbridge/Memory Controller under any load what so ever? Memtest does NOT fully test a memory controller either...

Unless you hammer your hardware at 90% of their CPU Tcase limits then it could be unstable. [And to be honest you should test at just under Tcase limit]

Given running prime95 under linux is so easy. If your sending systems off to customers surely spending 24 hours with it running blend at 90% of the CPUs thermal limits is surely the minimum you should do?

I tested mine at 55C load constant (62C limit for my chip I believe), then to be double sure I've upped the under load fan speeds so even if the room doubles in temp (20->40C) then the CPU will still be at 55C or lower.

In addition to that I will be testing for 24 hours at each C state (which I've customised for low power usage - C3 @ 800mv not 1000mv!) so I know every state, frequency, voltage and temperature my CPU is 24 hours or more stable

prostuff1 · November 28, 2011

we don't overclock/underclock/change anything like that on our unRAID systems.

memtest will sufficiently test memory for our needs, and the constant disk activity along with the network activity we test in our file copy and verify operations.

I believe in testing but not to the extent you are going to for a server. For a desktop gaming machine I might do slightly different sweet of testing but probably would not go to the extent you are.

Your testing would add more than a week(ish) of testing on top of what we already do.

WeeboTech · November 28, 2011

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough. Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load. Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

I plan to use badblocks on all my new drives from now on. 4 passes of specific bit patterns should suffice to exercise the drive enough.

You would be surprised at how many drives will timeout while attempting to error correct. It's one way to weed out questionable drives. The badblock test does 4 patterns of

0xaa 10101010

0x55 01010101

0xff 11111111

0x00 00000000

I plan to add a custom test that will create a file on the filesystem over the lan with each specific bit pattern to fill a drive. Then compare a calculated md5sum vs what is read back.

WeeboTech · November 28, 2011

I am all for testing hardware stability, but this just seems like overkill... especially for an unRAID system.

For unRAID a 24 hour memtest, a parity build and a couple of checks, and running a 3-4 preclears at the same time should be enough. Whenever testing a customers build I usually load it as full as possible with drives and test starting and stopping the array, rebooting over and over, etc to test that the PSU can handle a full load. Nearly all drives I test with are older 7200RPM drives, while most people seem to be moving towards the "Green" drives now.

But that doesn't put the CPU/Northbridge/Memory Controller under any load what so ever? Memtest does NOT fully test a memory controller either...

Unless you hammer your hardware at 90% of their CPU Tcase limits then it could be unstable. [And to be honest you should test at just under Tcase limit]

Given running prime95 under linux is so easy. If your sending systems off to customers surely spending 24 hours with it running blend at 90% of the CPUs thermal limits is surely the minimum you should do?

I tested mine at 55C load constant (62C limit for my chip I believe), then to be double sure I've upped the under load fan speeds so even if the room doubles in temp (20->40C) then the CPU will still be at 55C or lower.

In addition to that I will be testing for 24 hours at each C state (which I've customised for low power usage - C3 @ 800mv not 1000mv!) so I know every state, frequency, voltage and temperature my CPU is 24 hours or more stable

I might agree that if I were building allot of machines I would spend a day or so running prime95 while doing other tests. It does tax the machine in unique ways and I've seen it catch questionable builds.

Interstellar · November 28, 2011

we don't overclock/underclock/change anything like that on our unRAID systems.

memtest will sufficiently test memory for our needs, and the constant disk activity along with the network activity we test in our file copy and verify operations.

I believe in testing but not to the extent you are going to for a server. For a desktop gaming machine I might do slightly different sweet of testing but probably would not go to the extent you are.

You testing would add more than a week(ish) of testing on top of what we already do.

Irrelevant if they are at stock, over clocked or anything.

I've had systems unstable at load at stock settings.

Memtest86+ tests the stick individually, which hardly produces a load on the memory controller.

My tests on my setup has shown that:

1. memtest86+ is 24 hours stable with 4 sticks at auto settings.

2. CPU is stable to 72 hours prime95 and using Linpack set at 1GB ram for 50 passes.

3. Is not stable when you run Linpack using 7.2GB in windows (i.e. maximum CPU, Northbridge and memory IO loads), which clearly shows it is the memory controller that is having issues with 4 sticks.

Bumping up the CPUs NB voltage to 1.2V (instead of 1.175V) or reducing the timings/frequency solves this problem.

Auto mode does *not* imply stability I'm afraid.

Rajahal · December 2, 2011

Thanks for the feedback, everyone. While we haven't seen any defective CPUs to date, that certainly doesn't mean that it can't happen. We'll consider adding prime95 into our testing suite. We current burn in each server we sell for approximately two weeks before considering it ready to ship. However, most of our tests focus on the SATA controllers (both on the motherboard and PCIe cards), network controller, power supply, and RAM as these are the most common areas of issues and hardware incompatibility in unRAID servers.

Weebo, we're definitely considering adding badblocks into our testing suite as well, although I'm curious to hear your response to Joe's concern about badblocks only testing the drive's cache. Thoughts?

Concorde Rules: Keep in mind that most unRAID servers are designed for stability and reliability as a NAS and nothing else. Most servers won't see above 10% CPU usage in their entire lifespan, and the RAM cache is often bypassed as large media files are copied to and from the server (remember that a server with 1 GB of RAM or 8 GB of RAM will perform identically when a 40 GB BluRay ISO is being written to the array). Essentially, the Northbridge, CPU, and RAM are vastly underutilized in most cases (case in point - how many other modern systems do you know that can run well on just 512 MB of RAM?). It is absolutely feasible that an unRAID server with a CPU that is unstable at 90%+ utilization or a memory controller that has issues with 4 sticks of RAM can live a long and happy life without the user ever having a single problem. It is the Southbridge, SATA controllers, and NIC that are under load and need to be extensively tested. While I'm all for running tests when it is trivial to do so, you also have to keep in mind your return on investment (in terms of time and labor).

Interstellar · December 3, 2011

Concorde Rules: Keep in mind that most unRAID servers are designed for stability and reliability as a NAS and nothing else. Most servers won't see above 10% CPU usage in their entire lifespan, and the RAM cache is often bypassed as large media files are copied to and from the server (remember that a server with 1 GB of RAM or 8 GB of RAM will perform identically when a 40 GB BluRay ISO is being written to the array). Essentially, the Northbridge, CPU, and RAM are vastly underutilized in most cases (case in point - how many other modern systems do you know that can run well on just 512 MB of RAM?). It is absolutely feasible that an unRAID server with a CPU that is unstable at 90%+ utilization or a memory controller that has issues with 4 sticks of RAM can live a long and happy life without the user ever having a single problem. It is the Southbridge, SATA controllers, and NIC that are under load and need to be extensively tested. While I'm all for running tests when it is trivial to do so, you also have to keep in mind your return on investment (in terms of time and labor).

From my observations even with my CPU at 3.2GHz shfs uses >50% of a core and if your really hammering it with multiple connections hits 80% of total load of a dual core.

Unless my server is broken it certainly doesn't sit at 10% or less all it's life, and besides, this is a what if situation. One day the client might hammer the CPU...

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

Really hammer it then if your also doing I/O tasks while all cores are fully loaded

WeeboTech · December 3, 2011

Weebo, we're definitely considering adding badblocks into our testing suite as well, although I'm curious to hear your response to Joe's concern about badblocks only testing the drive's cache. Thoughts?

I have not seen that comment... But I'll have to search for it. If you know where it is please post a link.

WeeboTech · December 3, 2011

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

Really hammer it then if your also doing I/O tasks while all cores are fully loaded

I think a prime95 test while the disk stress test should be good enough. I wouldn't turn off fans so the inside gets hot.

I've never done that and do not feel the need to, nor would I. It's not a real world case.

If a fan dies, the fan should be replaced.

From what I've seen intel chips these days are not as prone to high stress failure like many years ago.

Where I do see issues is in heatskin/paste/thermal handling. I have a couple of boards where the bios will start to beep when i stress out the machine heavily. On those I think the heatsink compound is not good or aged. That's a pretty valid point for testing with a CPU intensive application. As is memory bandwidth exercising.

chickensoup · December 4, 2011

Other than that stick to brand name parts, e.g. a thermaltake/corsair 450W PSU is nowhere near as good as a generic.

I'm guessing that's a typo?

Corrected :-)

Interstellar · December 5, 2011

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

Really hammer it then if your also doing I/O tasks while all cores are fully loaded

I think a prime95 test while the disk stress test should be good enough. I wouldn't turn off fans so the inside gets hot.

I've never done that and do not feel the need to, nor would I. It's not a real world case.

If a fan dies, the fan should be replaced.

From what I've seen intel chips these days are not as prone to high stress failure like many years ago.

Where I do see issues is in heatskin/paste/thermal handling. I have a couple of boards where the bios will start to beep when i stress out the machine heavily. On those I think the heatsink compound is not good or aged. That's a pretty valid point for testing with a CPU intensive application. As is memory bandwidth exercising.

Well I suppose if you have the shutdown limit below the CPU's upper limit then your covered I suppose.

I just feel for such an important machine if I know it can survive operating at the utter limit for 72 hours then when it gets there by accident it will be ok.

WeeboTech · December 5, 2011

If your built servers are on-site for two weeks why can't you spend a day or so running Prime95 in Blend mode at the CPUs max operating temp? (I remove the exhaust fan from my setup which makes sure no cool air is passing over anything, forcing everything to work at a higher temp)

Really hammer it then if your also doing I/O tasks while all cores are fully loaded

I think a prime95 test while the disk stress test should be good enough. I wouldn't turn off fans so the inside gets hot.

I've never done that and do not feel the need to, nor would I. It's not a real world case.

If a fan dies, the fan should be replaced.

From what I've seen intel chips these days are not as prone to high stress failure like many years ago.

Where I do see issues is in heatskin/paste/thermal handling. I have a couple of boards where the bios will start to beep when i stress out the machine heavily. On those I think the heatsink compound is not good or aged. That's a pretty valid point for testing with a CPU intensive application. As is memory bandwidth exercising.

Well I suppose if you have the shutdown limit below the CPU's upper limit then your covered I suppose.

I just feel for such an important machine if I know it can survive operating at the utter limit for 72 hours then when it gets there by accident it will be ok.

It makes sense and warrants discussion. If I were building machines for sale, I might do that.

Ideally I would build some form of alarming system into the machine or via software monitoring.

brian89gp · December 6, 2011

I almost always buy workstation to server class hardware and always run everything at stock speeds/voltages because to me overclocking can only make things less stable. Now my less stable might be another persons stable enough.

For my classification of stable, talking servers here because desktop applications throw a wrench in things, I expect my server to be online 100% of the time for 3-4 years with no failures.

So how did you confirm that your hardware is all working ok? You assume it is?

Just because it's server class hardware doesn't mean you could have got a duff one?

I assume the CPU and motherboard work if they boot. I don't know if Supermicro etc have better QA but I have yet to have a problem. I typically buy open-box specials too...

As for memory, most use ECC RAM. If one is bad it gets flagged as such and either continued to be used and corrected or disabled from use entirely. IE, if one is bad it tells me, no BSOD here. Sometimes I run a 1-2 pass memtest, sometimes I don't.

Never had bad RAM bring one of my machines down (I have had a couple bad sticks since I buy almost 100% used off of ebay) and never had bad RAM bring down one of the servers I deal with at work (currently 60+ servers and 8+TB of RAM). ECC is nice in this regard.

As the saying goes, you get what you pay for.

Interstellar · December 15, 2011

As the saying goes, you get what you pay for.

Well I had a DFI Sli-DR once, the most expensive S939 board at the time and that didn't work either.

I think you've been pretty lucky with no hardware failures at all!

But yes ECC memory is nice.

System Stability Testing - How do you test and what do you consider "stable"?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation