preclear vs extended smart test

pyrater · February 4, 2018

I know preclear basicly writes 0s to the entire drive and smart test checks the drive IAW the mfg specs. So which is better and why?? AKA if unraid has smart test built in why even bother with preclear and just run a long test on a new drive?

ashman70 · February 4, 2018

I believe it has to do with the fact that a pre clear is a more rigorous 'burn in' of the drive and if the drive is going to fail, it is more likely to do so sooner than later. The pre clear rigorously reads, writes then wipes the drive whereas a smart test is no where near as rigorous. That is my opinion, I am not stating that I know this for a fact.

SSD · February 4, 2018

Preclear does a complete read, complete write, and another complete read. This is designed to give the SMART system the notice any weak areas on the disk that require reallocation of sectors. I cannot say exactly with the extended smart test does, but it takes far less time and I have to believe is not as thorough. There are probably other burn in programs that are equally or perhaps better than preclear. But it is a pretty good test IMO.

JorgeB · February 4, 2018

6 hours ago, SSD said:

I cannot say exactly with the extended smart test does

The extended test is a complete surface read.

pwm · February 4, 2018

12 hours ago, johnnie.black said:

The extended test is a complete surface read.

I would assume the extended SMART test is the same test the factory does before shipping (but they then clear all SMART counters, logs etc to not leave any trace of the hours the disk was running when testing etc).

But since the drive has been suffered unknown amounts of vibrations, shocks, temperatures etc during the full route from factory to end user it definitely doesn't hurt to test-write to every sector and then let the extended test verify that all sectors are still trustworthy.

SSD · February 6, 2018

On 2/4/2018 at 4:47 PM, pwm said:

I would assume the extended SMART test is the same test the factory does before shipping (but they then clear all SMART counters, logs etc to not leave any trace of the hours the disk was running when testing etc).

But since the drive has been suffered unknown amounts of vibrations, shocks, temperatures etc during the full route from factory to end user it definitely doesn't hurt to test-write to every sector and then let the extended test verify that all sectors are still trustworthy.

I very much doubt they are doing a full surface scan on every disk. Maybe a statistical sampling of disks get checked out. But every one would slow down production too much.

pwm · February 6, 2018

5 hours ago, SSD said:

I very much doubt they are doing a full surface scan on every disk. Maybe a statistical sampling of disks get checked out. But every one would slow down production too much.

They need to perform a full surface scan.

It is not possible to create fault-free surfaces. They need to identify bad sectors and set up an initial list of remapped sectors, because all writes performed are performed without read-back. So writes depends on trust - one of the trusted parameters is that the sector surface has already been screened for physical faults. And that is performed with the drives own heads. When you buy a disk and see that the remapped count is 0, that just means there has not been any new remapping after the drive left the factory.

There are several very lengthy steps that are required when manufacturing disks that just can't be avoided. Even before the surface testing, it's often the drives themselves that spends a long time writing down the servo information that allows the final HDD to find the individual tracks and sectors.

Remember that the drives can perform these tests by themselves. And in the factory, you don't need a SATA controller for the testing so it's possible to test a huge number of devices concurrently.

SSD · February 6, 2018

2 hours ago, pwm said:

They need to perform a full surface scan.

It is not possible to create fault-free surfaces. They need to identify bad sectors and set up an initial list of remapped sectors, because all writes performed are performed without read-back. So writes depends on trust - one of the trusted parameters is that the sector surface has already been screened for physical faults. And that is performed with the drives own heads. When you buy a disk and see that the remapped count is 0, that just means there has not been any new remapping after the drive left the factory.

There are several very lengthy steps that are required when manufacturing disks that just can't be avoided. Even before the surface testing, it's often the drives themselves that spends a long time writing down the servo information that allows the final HDD to find the individual tracks and sectors.

Remember that the drives can perform these tests by themselves. And in the factory, you don't need a SATA controller for the testing so it's possible to test a huge number of devices concurrently.

You sound like you have first hand knowledge. You work in this field?

There is an interesting piece on Tom's Hardware about WD's drive R&D and manufacturing .The link below takes you to the first slide. You need to go to slide 38 called Behold Excalibur! manually).

http://www.tomshardware.com/picturestory/525-western-digital-tour.html#s38

They reference a machine called Excalibur that does disk testing on every drive, but don't really explain what it does. But they say each machine can test 5000 drives in parallel. And they do call it long term testing.

A little math shows this is actually possible. 30 (a number I came up with) Excaliber machines x 365 (each bay can do 365 per year) x 5000 bays = 54 million drives, which is roughly the number WD makes in a year. 30 such machines is a reasonable number.

Pretty amazing scale!

But you'd think if this process is routinely identifying and internally remapping sectors on every disk, that you'd see harmless remapped sectors popping up here and there on drives in the wild. But we NEVER see this. It is overwhelmingly likely if you see one, you are going to see more and more with every subsequent disk scan (e.g., parity check). If Excalibur is doing this exact same type of remapping operation, yet very very rarely do we see a drive with bad sectors when we test them, while drives in the wild, after a remapping, virtually always result in more - this doesn't compute for me. And led me to the conclusion that drives were pretty much perfect off the assembly line. Any explanation?

pwm · February 6, 2018

1 minute ago, SSD said:

You sound like you have first hand knowledge. You work in this field?

There is an interesting piece on Tom's Hardware about WD's drive R&D and manufacturing. Here is a link to the most relevant slide from that material:

http://www.tomshardware.com/picturestory/525-western-digital-tour.html#s38

They reference a machine called Excalibur that does disk testing on every drive, but don't really explain what it does. But they say each machine can test 5000 drives in parallel. And they do call it long term testing.

A little math shows this is actually possible. 30 (a number I came up with) Excaliber machines x 365 (each bay can do 365 per year) x 5000 bays = 54 million drives, which is roughly the number WD makes in a year. 30 such machines is a reasonable number.

Pretty amazing scale!

But you'd think if this process is routinely identifying and internally remapping sectors on every disk, that you'd see harmless remapped sectors popping up here and there on drives in the wild. But we NEVER see this. It is overwhelmingly likely if you see one, you are going to see more and more with every subsequent disk scan (e.g., parity check). If Excalibur is doing this exact same type of remapping operation, yet very very rarely do we see a drive with bad sectors when we get them, while drives in the wild, after a remapping, virtually always result in more - this doesn't compute for me. And led me to the conclusion that drives were pretty much perfect off the assembly line. Any explanation?

No, I'm not involved in HDD manufacturing. But I have developed factory test equipment and software for a number of customers and products, and helped a number of other customers with how to design their firmware to allow as much functionality as possible to be testable without binding expensive external test equipment. That's also why most electronics has a number of extra connector pins, or patterns of gold-plated PCB pads, easily accessible.

Lots of the equipment I have worked with also requires support for advanced self-tests for the full product lifetime. Infrastructure equipment are often installed and in use for 10 years or more and it isn't practical to send out a technician unless something really is wrong. So quite often, the electronics gets developed with support for internal loopback support to allow in-system tests of critical subsystems.

HDD is a quite interesting concept - from a mechanical perspective it shouldn't have been possible to do what is actually done. There are a billion jokes about if the car had progressed as far as integrated circuits. But it isn't really fair to compare mechanics and electronics. But what if the mechanics of the cars could have progressed as much as the mechanics of a HDD?

Frank1940 · February 6, 2018

@SSD Remember that the drive does not have to read all the data perfectly off the drive every time. The actual data that is written to the drive includes substantial error checking and correction information for every sector written. I have also seen some indications that even if this first layer doesn't work, there is a higher layer of data recovery for blocks of data that is a fallback. Only when all of these fail, do you (the user) have any indication that there is a problem.

Something to do when you have a few hours to spare is to make a search and see if you can find the raw read rate error is (once in a while, you can 'find' this number) and calculate how few TB of data you have to read before you have the first read error. I was shocked the time I did it. (Assuming I didn't make a stupid mistake in my math.) The exact error detecting, correcting, and recovery algorithms are a trade secret for each manufacturer but you should be able to find at least some of the mathematical theories behind them with bit of a searching.

SSD · February 6, 2018

1 hour ago, Frank1940 said:

@SSD Remember that the drive does not have to read all the data perfectly off the drive every time. The actual data that is written to the drive includes substantial error checking and correction information for every sector written. I have also seen some indications that even if this first layer doesn't work, there is a higher layer of data recovery for blocks of data that is a fallback. Only when all of these fail, do you (the user) have any indication that there is a problem.

Something to do when you have a few hours to spare is to make a search and see if you can find the raw read rate error is (once in a while, you can 'find' this number) and calculate how few TB of data you have to read before you have the first read error. I was shocked the time I did it. (Assuming I didn't make a stupid mistake in my math.) The exact error detecting, correcting, and recovery algorithms are a trade secret for each manufacturer but you should be able to find at least some of the mathematical theories behind them with bit of a searching.

Yes, understand that the media works in conjunction with the redundancy. Not sure I realize just how much. But normal is normal. And this normal doesn't trigger sector reallocations.

But my question was specifically about the testing done before a drive is competed by the manufacturer. If it is reallocating sectors, why is it that they are stable, and don't generate more. Whereas every reallocated sector I've seen here in the past 8 years have been quicky followed by more, and are the leading predictor of drive failure.

pwm · February 6, 2018

5 hours ago, SSD said:

Yes, understand that the media works in conjunction with the redundancy. Not sure I realize just how much. But normal is normal. And this normal doesn't trigger sector reallocations.

But my question was specifically about the testing done before a drive is competed by the manufacturer. If it is reallocating sectors, why is it that they are stable, and don't generate more. Whereas every reallocated sector I've seen here in the past 8 years have been quicky followed by more, and are the leading predictor of drive failure.

Since the drive is analog, there isn't any two simple "perfect" and "broken".

The drive will consider a sector "good enough" and leave it to statistics about what the safety margins is. But since not every single parameter is known to 100%, the safety margin for some specific sector may end up lower than expected. So you may get some few random sector failures to crop up with time. And sometimes it might be external causes (power, vibration, ...) that makes the drive falsely mark a sector as bad.

But you can be unlucky and have a head crash down on the surface. Or a small dust particle end up on surface or head instead of being caught in the clean room or in the filters in the drive. Any number of things can result in a single head lose performance or friction grind off a couple of atom layers from the surface. Or lubrication pool up (which was the reason for the IBM DeathStar failures) that makes some part of the surface become harder to read and/or write. Wear in the head assembly may potentially affect the align. The drive performs partially "blind" align when writing - it uses the read head to listen to the servo tracks while searching, and then uses an internal lookup table for how much additional offset it needs to instead align the write head over the intended track. And this offset varies for different part of the surface.

When a drive develops many defects, there is often a distinctive locality between the failing sectors. You might manage to recover 99% of the files but then have a few files that each suffers from multiple broken sectors.

Maybe we should see it a bit like when the car gets hit by a stone. If the stone hit hard enough, it will damage the rust protection under the paint and you will with time get a larger and larger failure as the metal rusts. If the stone didn't hit hard enough, then you just got a tiny nick in the clear paint but nothing more will happen.

It's expensive to do the final parts of the factory test of assembled drives. So the factory will pre-screen the platters for failed sputtering. If the factory test has to remap some sectors it will be because of extremely small defects that will not affect the rest of the platter or the operation of the heads. When you see a drive where the reallocated count starts to tick up, then something has happened with the drive. Unless we know we did abuse the drive we will never know exactly what went wrong. Just that the ticking counters means we don't just have some spurious sector that was just a tiny bit outside of the margin but that something serious has happened and potentially a very large number of more sectors will fail.

Only the drive manufacturers can tell exactly how catastrophic it would be if a single read head touches the surface and loses 5% or 10% signal strength. But when the margins are small even small accidents can give a big fallout.

preclear vs extended smart test

Recommended Posts

pyrater

Link to comment

ashman70

Link to comment

SSD

Link to comment

JorgeB

Link to comment

pwm

Link to comment

SSD

Link to comment

pwm

Link to comment

SSD

Link to comment

pwm

Link to comment

Frank1940

Link to comment

SSD

Link to comment

pwm

Link to comment

Archived