A good, but bad disk... badblocks alone can never be trusted.


Recommended Posts

Badblock's equivalent is the -b option, and badblocks seem to ignore bytes that don't constitute a full block.

 

My prior comment may have been in haste, I believe an incomplete -b (block size) is ignored.

-c option is how many -b's to read in one instance.

 

Bad blocks will calculate how many -b values can be read from the total size if it's not supplied. with remainder discarded if it is not evenly divisible.

If there is a partial block at the end it probably will ignore it.

Hence why we do not see the expected results at the end for 1 extra byte.

 

What is the exact condition of your test with a hard drive. That's the tell tale.

We can use Joe.L's example from the original post:

 

That's not an accurate example The numbers are actually divisible by block size.

It's based on complete blocks of -b value.

 

I don't think the user of badblocks should burden himself with such calculations. I would expect that every single byte of the disk will be tested, or at least a warning be given otherwise.

 

Every block should be read unless indivisible values are supplied on the command line or the program is not used as expected.

 

This is why I wanted to see your exact test that dealt with the drive itself.

Something is flawed here.

 

Every single 'BLOCK' of the disk should be read. unless incorrect values are provided by -c and -b

Calculating -b and -c incorrectly will surely provide incorrect results.

 

The -b option is for a single block read, -c is how many -b's to read. If it's short, and error is printed.

I will re-confirm this since I kept seeing these weird errors printed out when I was trying different values.

 

I believe badblocks calculates from 0 to n blocks based on -b if there is an extra byte or two after the end, they are not read.  I believe the total size must be evenly divisible by -b, Anything else would cause a problem.  Normally these values are automatically calculated so I would expect it to operate normally under regular conditions.

 

If you were to fill a hard drive with nulls, then try to add 4 more bytes at the end, you would fail.

 

Disks normally are in a set number of blocks. You don't usually have an incomplete block on a disk. That's why the testing criteria is important. If there's a bug that's reproducible, then it can be repaired.

 

I've never seen a disk with n blocks plus half a block, unless something funky is going on with an HPA.

 

2,000,398,934,016 is evenly divisible by 1024.

 

I would really like to go through this more so I can compare your test against the source code.

 

I can see exactly why adding 4 bytes to the end of a flat file comes up with what looks like a program bug, but the reality it's not working with complete disk blocks so it will miscalculate how many blocks to read.  I'm not so sure this is a bug since it's actually expecting to work with disk blocks of a fixed size.  Once I locate the math for automatically calculating the last_block, then it should be apparent.

 

My tests of using small minute values with -b 1 -c 1 show that data at the end of the file is picked up as invalid. I think the issue is based in even divisibility of -b with the total size of what it is you are testing. Any remainders are discarded since it's expecting to work with fixed blocks.  I'll have to review it further. Another emergency is taking precedence.

Link to comment
  • Replies 77
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Badblock's equivalent is the -b option, and badblocks seem to ignore bytes that don't constitute a full block.

 

My prior comment may have been in haste, I believe an incomplete -c (block size) is ignored.

-b option is how many -c's to read in one instance.

 

You've mixed up the -b and the -c.  It's actually the other way around:  -b it the block-size, and -c is the nubmer of such blocks to be tested at a time.  That lead you to completely misinterpret my previous post.  The numbers there still hold: 2000398934016 is not evenly divisible by 65536.  So any garbage you have in the last 24576 bytes of that disk will not be detected by the command Joe.L used in the OP.

 

 

Link to comment

 

You are right, I flipped -b and -c.

I was dealing with another situation. 

Power loss at office with the phone ringing constantly.

I'm going to re-edit my prior post so as not to confuse the issue.

The root cause I'm finding is that using badblocks on a flat file that is not evenly divisible by the block size will cause it to not read the last incomplete block.

 

This should not happen on a disk since the disk is probed for it's size and it is expected to already be divisible by block size unless a user supplied something different.

 


root@unRAID:/tmp# badblocks
Usage: badblocks [-b block_size] [-i input_file] [-o output_file] [-svwnf]
       [-c blocks_at_once] [-d delay_factor_between_reads] [-e max_bad_blocks]
       [-p num_passes] [-t test_pattern [-t test_pattern [...]]]
       device [last_block [first_block]]

 

 


         case 'c':
                    blocks_at_once = parse_uint(optarg, "blocks at once");
                     break;

         case 'b':
                     block_size = parse_uint(optarg, "block size");
                     break;
...

        bb_count = test_func(dev, last_block, block_size,
                          first_block, blocks_at_once);
...

           set_o_direct(dev, buffer, try * block_size,
                     ((ext2_loff_t) current_block) * block_size);
...
           got = read (dev, buffer, try * block_size);
...

Normally last_block is determined by device size. (which should be divisible by block_size). 

     device_name = argv[optind++];
        if (optind > argc - 1) {
                errcode = ext2fs_get_device_size2(device_name,
                                                 block_size,
                                                 &last_block);
                if (errcode == EXT2_ET_UNIMPLEMENTED) {
                        com_err(program_name, 0,
                                _("Couldn't determine device size; you "
                                  "must specify\nthe size manually\n"));
                        exit(1);
                }
                if (errcode) {
                        com_err(program_name, errcode,
                                _("while trying to determine device size"));
                        exit(1);
                }
        } else {
                errno = 0;
                last_block = parse_uint(argv[optind], _("last block"));
                last_block++;
                optind++;
        }
        if (optind <= argc-1) {
                errno = 0;
                first_block = parse_uint(argv[optind], _("first block"));
        } else first_block = 0;


 

2000398934016 is evenly divisible by -b 1024.

 

The total amount of blocks needs to be evenly divisible by the block_size not the count.

 

It reads in chunks of block size * count, up until last_block.

Last block is calculated with block_size and the total size of the file being tested without the remainder.  Thus why partial blocks at the end are not being read if the values are incorrect.

 


dd if=/dev/zero of=zeros bs=1024 count=1024  
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.0040629 s, 258 MB/s

ls -l zeros
-rw-rw-rw- 1 root root 1048576 Aug 14 18:15 zeros

root@unRAID:/tmp# badblocks -b 1024 -c 50000 -t0x00 -v zeros   
Checking blocks 0 to 1023
Checking for bad blocks in read-only mode
Testing with pattern 0x00: done                                
Pass completed, 0 bad blocks found.

root@unRAID:/tmp# echo "testing" >> zeros

echo "testing" >> zeros
root@unRAID:/tmp# ls -l zeros
-rw-rw-rw- 1 root root 1048584 Aug 14 18:18 zeros


root@unRAID:/tmp# badblocks -b 1024 -c 50000 -t0x00 -v zeros 
Checking blocks 0 to 1023
Checking for bad blocks in read-only mode
Testing with pattern 0x00: done                                
Pass completed, 0 bad blocks found.

NOT DETECTED SINCE LAST BLOCK IS STILL 1023
1048584/1024 = 1024.0078125

Add a full block. 

root@unRAID:/tmp# dd if=/dev/zero bs=1024 count=1 >> zeros 
1+0 records in
1+0 records out
1024 bytes (1.0 kB) copied, 3.8096e-05 s, 26.9 MB/s

root@unRAID:/tmp# badblocks -b 1024 -c 50000 -t0x00 -v zeros 
Checking blocks 0 to 1024
Checking for bad blocks in read-only mode
Testing with pattern 0x00: 1024
done                                
Pass completed, 1 bad blocks found.

INCORRECT VALUE IS NOW DETECTED Notice how it calculated checking blocks differently now.

last_block seems to be based on int(total_size/block_size);

 

 

The program wasn't designed to work with an uneven (total_size/block_size).

It's clear. It's designed to work with fixed disk blocks.

 

 

If you can supply the test case where the disk was written to with invalid values and the program failed to detect it I would like to see it.  If I can reproduce it, I can fix it.

 

 

From what I'm seeing in my tests, the start is detected, the middle is detected and the end is detected up and including the last block. i.e. int(total_size/block_size);  This is a known factor. It's unexpected to have a remainder of a partial block on a physical disk.

Link to comment

2000398934016 is evenly divisible by -b 1024.

You're still confusing things.  The OP used "badblocks -b 65536"

 

If you can supply the test case where the disk was written to with invalid values and the program failed to detect it I would like to see it.  If I can reproduce it, I can fix it.

There's your test case, in the first post of this thread:  If you use the exact same command that the OP did, then badblocks won't touch whatever's in the last 24576 bytes of that disk.

 

 

The full badblocks command I used was:

badblocks -c 1024 -b 65536 -vsw -o /boot/badblocks_out_sdl.txt  /dev/sdl

The smart report looks like this:

User Capacity:    2,000,398,934,016 bytes

 

 

Link to comment

2000398934016 is evenly divisible by -b 1024.

You're still confusing things.  The OP used "badblocks -b 65536"

 

If you can supply the test case where the disk was written to with invalid values and the program failed to detect it I would like to see it.  If I can reproduce it, I can fix it.

There's your test case, in the first post of this thread:  If you use the exact same command that the OP did, then badblocks won't touch whatever's in the last 24576 bytes of that disk.

 

The full badblocks command I used was:

badblocks -c 1024 -b 65536 -vsw -o /boot/badblocks_out_sdl.txt  /dev/sdl

The smart report looks like this:

User Capacity:    2,000,398,934,016 bytes

 

It's incorrect usage.

It will ignore anything not evenly divisible.

This is what I've been stating.

 

 

It's a known and always has been for years.

I remember this being the case from many years ago.

The problem is the program doesn't report it, it's expecting you to know what your doing when you redfine what would have been retrieved from the file system call.

 

I remember the warnings from years ago on the format command too.

 

Redo the test with them switched around and it should work correctly.

Link to comment

I wrote all zeros to a disk, then intentionally wrote a few non-zero bytes here and there.  To my dismay, a read-only badblocks pattern test for zeros passed with flying colors:

Checking for bad blocks in read-only mode
Testing with pattern 0x00: done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Yet, I could hexdump the disk and see the non-zero bytes.

 

This is the test that needs review. Joe l's usage will have issues.

My own tests using those conditions reveal issues.

It's known that ending blocks are not read if you put values that do not make sense with the geometry of a disk then let it auto calculate block count from uneven amounts.

 

It's the here and there that's the issue.

 

If you don't tell badblocks the first and last block, it calculates the last_block based on a block_size.

If you give it a block size that is not evenly divisible, it will not work correctly.

 

If the badblocks is able to use what it detects from the ext2fs call, do these conditions come up?

 

Fact is, it tells us what blocks it is reading with Joe's command line.

None of us caught that it missed a block.

 

With Normal usage I don't touch the block size unless it's a multiple of the disk.

In every invocation I've done, I let it calculate the first and last block from what it finds on the file system.

 

I have used -c to increase the count. From my other tests it shows it works.

 

Disk /dev/md3: 3000.6 GB, 3000592928768 bytes
                                        
root@unRAID:/tmp# badblocks -vs /dev/md3                 
Checking blocks 0 to 2930266531

root@unRAID:/tmp# badblocks -vs -c 1024 /dev/md3         
Checking blocks 0 to 2930266531

root@unRAID:/tmp# badblocks -vs -b 1024 -c 65536 /dev/md3
Checking blocks 0 to 2930266531

root@unRAID:/tmp# badblocks -vs -c 1024 -b 4096 /dev/md3
Checking blocks 0 to 732566632

root@unRAID:/tmp# badblocks -vs -c 1024 -b 65536 /dev/md3     
Checking blocks 0 to 45785413

This clearly is not going to work as a some blocks have now been discarded due to truncation.
3000592928768 / 65536 = 45785414.5625

Link to comment

It's the here and there that's the issue.

The "here and there" was a poor choice of words, I appologize.  How I discovered this issue was, I wanted to "preclear" the disk with a single write-pass of badblocks with a zero pattern.  Then I hexdumped the disk for peace of mind, and I saw the garbage at the end.  Trying to reproduce the thing, I zeroed out the whole disk with dd and just wrote a few non-zero bytes near the end.  The single read-pass of badblocks passed that, and that freaked me out, and I reported it here.  So, the non-zero bytes weren't really "here and there", they were mostly near the end of the disk.  Now I understand what's happened.  Still, that's not how I would expect badblocks to work.  I guess I'll always have to make careful calculations myself before invoking badblocks.  That's certainly not an intuitive behavior.

 

 

Link to comment

It's the here and there that's the issue.

The "here and there" was a poor choice of words, I appologize.  How I discovered this issue was, I wanted to "preclear" the disk with a single write-pass of badblocks with a zero pattern.  Then I hexdumped the disk for peace of mind, and I saw the garbage at the end.  Trying to reproduce the thing, I zeroed out the whole disk with dd and just wrote a few non-zero bytes near the end.  The single read-pass of badblocks passed that, and that freaked me out, and I reported it here.  So, the non-zero bytes weren't really "here and there", they were mostly near the end of the disk.  Now I understand what's happened.  Still, that's not how I would expect badblocks to work.  I guess I'll always have to make careful calculations myself before invoking badblocks.  That's certainly not an intuitive behavior.

 

It works correctly if you let it get the information from the file system call.

 

Joe's example, albeit for speed, overrode what the program would have retrieved from the file system call.

At the very least you've brought up a real potential issue for people who use the program and alter the block size.  Thanks for your patience with this one. I was in fireman mode today.

 

It's not that it did a short read. It's that the calculated values say only read from 0 to here. Where 'here' is calculated incorrectly if an uneven block size is supplied.

 

I suppose if we were ever to use badblocks to replace the DD part, we would put a warning if the calculated blocks have a remainder.

 

However, it's designed to let you pick your own values if you want to go over a specific set of blocks on your disk.

For example, I remember reading a few pages years back about using badblocks to read from/to problematic blocks in a -p num_passes method. I.E. to force a reallocation.

 

Even in my attempt to recover about a year or two back, it was important to supply specific values to read and re-read blocks in specific areas to help recover a failing disk.

Link to comment

It works correctly if you let it get the information from the file system call.

 

Joe's example, albeit for speed, overrode what the program would have retrieved from the file system call.

No, badblocks doesn't do any calculations about what the -b and -c values should be.  If you don't supply -b and -c, it just uses default values of 1024 and 64 respectively, which coincidently work for (most?) disks.  The only calculation apparently is the total number of bytes integer-divided by the -b block size to get the number of blocks it will do work on.

 

Now that we know that badblocks cuts off along the -b block size line, I just need to do some more investigation to see whether it also cuts off along the -c blocks-at-a-time line.  If that turns out to be true, then there's also a potential for garbage resulting from improper -c values.

 

 

Link to comment

It works correctly if you let it get the information from the file system call.

 

Joe's example, albeit for speed, overrode what the program would have retrieved from the file system call.

No, badblocks doesn't do any calculations about what the -b and -c values should be.  If you don't supply -b and -c, it just uses default values of 1024 and 64 respectively, which coincidently work for (most?) disks.  The only calculation apparently is the total number of bytes integer-divided by the -b block size to get the number of blocks it will do work on.

 

Now that we know that badblocks cuts off along the -b block size line, I just need to do some more investigation to see whether it also cuts off along the -c blocks-at-a-time line.  If that turns out to be true, then there's also a potential for garbage resulting from improper -c values.

 

It calculates "last_block" based on block size and what is returned by the device/filesystem calls.

 

typedef __u64    blk64_t


        if (optind > argc - 1) {
                errcode = ext2fs_get_device_size2(device_name,
                                                 block_size,
                                                 &last_block);
                if (errcode == EXT2_ET_UNIMPLEMENTED) {
                        com_err(program_name, 0,
                                _("Couldn't determine device size; you "
                                  "must specify\nthe size manually\n"));


in another library
/*
* Returns the number of blocks in a partition
*/
errcode_t ext2fs_get_device_size2(const char *file, int blocksize,
             blk64_t *retblocks)
{

 

Examples here: http://libext2fs-wii.googlecode.com/svn-history/r20/trunk/source/getsize.c

 

the -c should not be an issue. in do_read the read amount is block_size * try (count).

It will return what ever was read. up to and including the max if it was available.

Just like before.

Link to comment

Now that we know that badblocks cuts off along the -b block size line, I just need to do some more investigation to see whether it also cuts off along the -c blocks-at-a-time line.

It turned out that weird -c blocks-at-a-time values don't cause badblocks to ignore any blocks if the last batch of blocks is less than the -c blocks-at-a-time value.  So we need not worry what -c is.  The only thing we need to ensure is that the total bytes of the disk|partition|file is evenly divisible by the -b block size, and then every single byte of it will be tested.

 

Once I had that verified to my satisfaction, I did some speed tests on a couple of disks.  I noticed that bumping up the default -b block size of 1024 to values higher than 4096 doesn't bring any additional speed improvements.  The command I liked the best at the end was:

badblocks  -vsw  -b 4096  -c 1024  -d 1  /dev/sdX

While the above was running, I had a separate background task that was doing a read from a random place on the disk once every few seconds.  That's a preclear setup I am happy with. :)

 

I am glad we sorted this out.  Thanks Weebo!

 

 

Link to comment
That is too bad!  I had very high hopes for badblocks.  See, I was investigating alternate ways to do a preclear, before Google brought me to this thread here.  The thing that started me was the sum command preclear uses for the post-read, which is a very very slow way to do that.  I might as well capture and parse the output of a `hexdump /dev/sdX | head` and do the job five times faster.  At the end, I narrowed down my choices to cmp and badblocks.  I especially liked badblocks for its -d option (read delay factor) which could make the system more responsive dirung a long-running preclear.  It's too bad I can't rely on badblocks to go through every single byte on the disk, that's a show-stopper.

You do realize that Joe L is making preclear do random seeks in addition to the compares to stress out the disks more with additional seeks?  That is one reason the post read is slow compared to the pre-read and write.
Link to comment

That is too bad!  I had very high hopes for badblocks.  See, I was investigating alternate ways to do a preclear, before Google brought me to this thread here.  The thing that started me was the sum command preclear uses for the post-read, which is a very very slow way to do that.  I might as well capture and parse the output of a `hexdump /dev/sdX | head` and do the job five times faster.  At the end, I narrowed down my choices to cmp and badblocks.  I especially liked badblocks for its -d option (read delay factor) which could make the system more responsive dirung a long-running preclear.  It's too bad I can't rely on badblocks to go through every single byte on the disk, that's a show-stopper.

You do realize that Joe L is making preclear do random seeks in addition to the compares to stress out the disks more with additional seeks?
You do realize that I do realize that?  For homework, patch that script to do the `sum` with -s, and watch how it will do the job twice as fast, which is still nowhere near how fast it does with other methods, especially on older CPUs or on ULV CPUs.

 

Link to comment

That is too bad!  I had very high hopes for badblocks.  See, I was investigating alternate ways to do a preclear, before Google brought me to this thread here.  The thing that started me was the sum command preclear uses for the post-read, which is a very very slow way to do that.  I might as well capture and parse the output of a `hexdump /dev/sdX | head` and do the job five times faster.  At the end, I narrowed down my choices to cmp and badblocks.  I especially liked badblocks for its -d option (read delay factor) which could make the system more responsive dirung a long-running preclear.  It's too bad I can't rely on badblocks to go through every single byte on the disk, that's a show-stopper.

You do realize that Joe L is making preclear do random seeks in addition to the compares to stress out the disks more with additional seeks?
You do realize that I do realize that?  For homework, patch that script to do the `sum` with -s, and watch how it will do the job twice as fast, which is still nowhere near how fast it does with other methods, especially on older CPUs or on ULV CPUs.

Just making sure.  Not disputing it could be faster just didn't seem like what you wanted would stress test it.  It seemed like it was to just read it from start to finish and compare without the random seeks but I will take your word for it.
Link to comment

Is the whole disk read with random seeks?

 

I thought there was a test where there were tons of random read seeks, but that this was just to shake the heads around.

I.E. it did not actually compare data of the whole drive.

Am I mistaken?

 

In any case, badblocks can do the random seeks.  you give it a starting block and a last block.

 

So if a program had a table, it could read every block by telling badblocks which starting and ending block, then check the return value. mark those blocks as read.

Once all blocks have been read the test is over.

 

Link to comment

In any case, badblocks can do the random seeks.

No need for that, really.  The random seeks are just to add some extra stress to the heads while badblocks is doing the real job of writing/verifying.  The random seeks are not speed-critical, and they can be done from a simple Bash script.  What I do is, my "preclear" script spawns a simple child shell that does random reads from the disk every second or so.  Then the preclear script invokes the badblocks command, and once badblocks is done, the script just kills the child shell.  These's no need for anything more complicated than that.

 

 

Link to comment

In any case, badblocks can do the random seeks.

No need for that, really.  The random seeks are just to add some extra stress to the heads while badblocks is doing the real job of writing/verifying.  The random seeks are not speed-critical, and they can be done from a simple Bash script.  What I do is, my "preclear" script spawns a simple child shell that does random reads from the disk every second or so.  Then the preclear script invokes the badblocks command, and once badblocks is done, the script just kills the child shell.  These's no need for anything more complicated than that.

Ah!  Now that sounds like a plan.  I understand now why it is better.
Link to comment

In any case, badblocks can do the random seeks.

No need for that, really.  The random seeks are just to add some extra stress to the heads while badblocks is doing the real job of writing/verifying.  The random seeks are not speed-critical, and they can be done from a simple Bash script.  What I do is, my "preclear" script spawns a simple child shell that does random reads from the disk every second or so.  Then the preclear script invokes the badblocks command, and once badblocks is done, the script just kills the child shell.  These's no need for anything more complicated than that.

Ah!  Now that sounds like a plan.  I understand now why it is better.

 

What I like about Joe's test is it really shakes, (rattles and rolls) the drive for a few moments.

 

I remember seeing the 'status' come up and hearing the drive go crazy for a short while.

While it concerned me, it also gave me a level of confidence in the drive.

 

I think it's a worthy test.

Link to comment

What I like about Joe's test is it really shakes, (rattles and rolls) the drive

 

I think it's a worthy test.

Yes, it is.  I am talking exactly the same shakes, rattles, and rolls, as Joe's script is doing them.  Only I don't really need 2266 lines of Bash code to accomplish that, if I can do it in about 50 lines. :)

 

 

Link to comment
  • 1 year later...

After three years one of my Hitache 5k3000 showed some write errors so i replaced it with a new drive.

No i let do it a cycle with the pre-clear script.

 

What do you think, if it passes that cycle would it be okay to use that HDD somewhere where not so important data are handled (outside of my unRaid)?

Link to comment

After three years one of my Hitache 5k3000 showed some write errors so i replaced it with a new drive.

No i let do it a cycle with the pre-clear script.

 

What do you think, if it passes that cycle would it be okay to use that HDD somewhere where not so important data are handled (outside of my unRaid)?

Without data (a smart report), it just a guess.

Link to comment

Okay here is the result of the preclear run:

 

Disk: /dev/sdb
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 5K3000
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    XXXXXXXXXXXXXXXXXX
LU WWN Device Id: 5 000cca 369d50055
Firmware Version: ML6OA5C0
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5940 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Jul 14 22:47:03 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(24664) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 411) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   161   161   024    Pre-fail  Always       -       356 (Average 325)
  4 Start_Stop_Count        0x0012   098   098   000    Old_age   Always       -       9528
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       12112
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3386
192 Power-Off_Retract_Count 0x0032   093   093   000    Old_age   Always       -       9533
193 Load_Cycle_Count        0x0012   093   093   000    Old_age   Always       -       9533
194 Temperature_Celsius     0x0002   146   146   000    Old_age   Always       -       41 (Min/Max 15/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       21

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 12026 hours (501 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 7e cc 00 00  Error: ICRC, ABRT at LBA = 0x0000cc7e = 52350

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 77 cc 00 e0 00      05:53:28.285  WRITE DMA
  ef 10 02 00 00 00 a0 00      05:53:28.285  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      05:53:28.285  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      05:53:28.284  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      05:53:28.284  SET FEATURES [set transfer mode]

Error 20 occurred at disk power-on lifetime: 12026 hours (501 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 7e cc 00 00  Error: ICRC, ABRT at LBA = 0x0000cc7e = 52350

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 77 cc 00 e0 00      05:53:27.970  WRITE DMA
  ef 10 02 00 00 00 a0 00      05:53:27.970  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      05:53:27.970  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      05:53:27.969  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      05:53:27.969  SET FEATURES [set transfer mode]

Error 19 occurred at disk power-on lifetime: 12026 hours (501 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 7e cc 00 00  Error: ICRC, ABRT at LBA = 0x0000cc7e = 52350

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 77 cc 00 e0 00      05:53:27.655  WRITE DMA
  ef 10 02 00 00 00 a0 00      05:53:27.655  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      05:53:27.655  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      05:53:27.654  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      05:53:27.654  SET FEATURES [set transfer mode]

Error 18 occurred at disk power-on lifetime: 12026 hours (501 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 7e cc 00 00  Error: ICRC, ABRT at LBA = 0x0000cc7e = 52350

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 77 cc 00 e0 00      05:53:27.340  WRITE DMA
  c8 00 08 7f cc 00 e0 00      05:53:27.340  READ DMA
  c8 00 08 77 cc 00 e0 00      05:53:18.370  READ DMA
  e5 00 00 00 00 00 40 00      05:52:38.165  CHECK POWER MODE
  e5 00 00 00 00 00 40 00      05:51:38.527  CHECK POWER MODE

Error 17 occurred at disk power-on lifetime: 11618 hours (484 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 66 dd 00 00  Error: ICRC, ABRT at LBA = 0x0000dd66 = 56678

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 5f dd 00 e0 00      02:48:16.360  WRITE DMA
  c8 00 08 5f dd 00 e0 00      02:48:16.352  READ DMA
  ea 00 00 00 00 00 a0 00      02:48:16.341  FLUSH CACHE EXT
  ca 00 08 6f dd 00 e0 00      02:48:16.341  WRITE DMA
  ca 00 08 67 dd 00 e0 00      02:48:16.341  WRITE DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%     12046         -
# 2  Short offline       Completed without error       00%     12046         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I cant see anything obvious. Maybe it was just a faulty cable?

Link to comment

Over 12000 hours of operation, and all I can see wrong are 21 corrupted packets!  Nothing wrong with the drive.  And the CRC errors would not have caused write errors, they are retried until successful.  About 86 operational hours ago, there was a cluster of 4 CRC errors, which is odd, possibly strong interference near the cable, or possibly some really flaky power issues.  Perhaps there were power issues then, and they caused the write errors, but at the moment, that looks like a fluke.  No reason at all you should not trust this drive.  The cable may not be perfect, but certainly isn't terrible, with only 20 errors in over 12000 hours.

Link to comment

Yep the UDMA_CRC_ERROR_COUNT is far more higher on my Parity drive.

 

root@HTMS:~# smartctl --all /dev/sdb | grep 199
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       77
root@HTMS:~# smartctl --all /dev/sdc | grep 199                                                                                                 
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
root@HTMS:~# smartctl --all /dev/sdd | grep 199
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
root@HTMS:~# smartctl --all /dev/sde | grep 199
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       4
root@HTMS:~# smartctl --all /dev/sdf | grep 199
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       2
root@HTMS:~# smartctl --all /dev/sdg | grep 199
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2
root@HTMS:~# smartctl --all /dev/sdh | grep 199
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

Maybe cable is not the best?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.