Are my drives as messed up as I think they are???

abernardi · May 27, 2010

Hi,

This is really a continuation of a thread I had in the software section:

http://lime-technology.com/forum/index.php?topic=6491.0

but, it looks like it's becoming more of a hardware issue now. There are partial syslogs there if you want to look at them.

I have been building a new unRAID system, I moved the bulk of my data over to 2-used 1.5TB drives 1 new 2TB drive and had started building the parity drive on another new 2TB drive when I ran into trouble. I was getting massive write errors on disk 2, a 1.5TB used drive.

I had mistakenly assumed these drives were in good shape. The 1.5TB drives had been in my iMac system as external media drives for a while without any noticeable problems.

I ran the S.M.A.R.T. tests on them and I thought they came out fine, but as Joe L. pointed out, I really don't know what I'm looking at, I was just noticing that they had "PASSED" a few times. So I did a little self-educating and now from my little knowledge I'm thinking that my disk 2 AND disk 3 are in serious trouble and might need to be scrapped.

I'm attaching the tests on the two bad disks and disk1 which looks good to me, but again, I'm new at this. Could some of you look at them and tell me what you're seeing? Is there any way to repair them?

I'm also including a "reiserfsck" check I did on disk1 and disk2 in case that might help. In the meantime, I'm trying to rescue what files I can from the affected disks and then depending on what you guys advise, I'll go from there.

Terminal_Saved_Output_2010-5-26.txt

smart-d2-2010-5-25.txt

smart-d3-2010-5-25.txt

smart-d1-2010-5-25.txt

Joe L. · May 27, 2010

Disk2 has 1965 unreadable sectors the disk has marked for re-allocation when they are next written. The only reason the disk is not currently failing is because the drive will first try to write and re-read those sectors before performing the re-allocation. If not successful, it will re-allocate the sectors with the contents being written. (It can't do it until you write to those sectors since it has no idea what is in them, since it can't read them)

Disk2

197 Current_Pending_Sector 0x0012 053 053 000 Old_age Always - 1965

Disk3 looks OK, but not great. It has 9000 hours on it and is apparently showing some signs of mechanical wear. It has had to re-try spinning up 29 times. The failure threshold is 97, and the current "normalized" value is 100. I'd keep an eye on it. Basically, this error occurs if the disk is unable to spin up to speed. It could be power supply related, but usually it is worn bearings on the motor.

Disk3

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 29

Disk1 looks in good shape. I see nothing in the smart test results that look like a problem.

Joe L.

abernardi · May 27, 2010

So, are you saying that if I reformat Disk2, it can be usable because it will just re-allocate data that is trying to be written to those 1965 bad sectors?

Also, I thought the smart report on Disk3 had some problems:

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 190227883

and

7 Seek_Error_Rate 0x000f 071 066 030 Pre-fail Always - 14511579

Isn't that bad?

And finally, as you advised I started the long smart test last night on Disks 2 & 3:

smartctl -d ata -t long /dev/sdc and smartctl -d ata -t long /dev/sda

I ran them simultaneously, I'm assuming that's alright (hope it is). Anyway, it estimated it would be finished around 3am this morning, but when I looked at the console, there was no report of finishing. Should there be? And should I have added something like:

>/boot/smart.txt

to get the report? Or should I just run the short report now? Again, Joe, thanks for the input.

Adam

BRiT · May 27, 2010

You should run an extensive round of preclearing on both of those drives to see if the error rates increase. If they do, consider both of them dead.

Joe L. · May 27, 2010

So, are you saying that if I reformat Disk2, it can be usable because it will just re-allocate data that is trying to be written to those 1965 bad sectors?

Also, I thought the smart report on Disk3 had some problems:

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 190227883

and

7 Seek_Error_Rate 0x000f 071 066 030 Pre-fail Always - 14511579

Isn't that bad?

The Raw Read Error Rate is fine. The current value is 118, the failure threshold is 6. The large number at the end of the line only has meaning to the manufacturer and they don't tell us how to interpret it. They only tell us how to interpret the "normalized" value (118 in this case)

The Seek Error Rate worst case was 66, the failure threshold is 30. It indicates there is an issue seeking, probably also an indication of a small amount of wear. It is not yet considered a failure by the SMART firmware, and it is why it is still reporting the drive as PASSED.

As far as spare sectors, most drives only have a few thousand spares. You are gonna get mighty close to using all them up. I'd consider that drive as a perfect candidate for an RMA.

And finally, as you advised I started the long smart test last night on Disks 2 & 3:

smartctl -d ata -t long /dev/sdc and smartctl -d ata -t long /dev/sda

I ran them simultaneously, I'm assuming that's alright (hope it is). Anyway, it estimated it would be finished around 3am this morning, but when I looked at the console, there was no report of finishing. Should there be? And should I have added something like:

>/boot/smart.txt

to get the report? Or should I just run the short report now? Again, Joe, thanks for the input.

Adam

No, there is not direct output for the long tests. You'll need to just get another "status" report for each (after waiting 4 or 5 hours) and the test results will be near the bottom of each. They call them "offline" tests.

So... type

smartctl -d ata -a /dev/sdc

and

smartctl -d ata -a /dev/sda

to see the results of the "long" test. If it encountered unreadable sectors, it may have even aborted. It will say so in the smart report output if it did.

Joe L.

abernardi · May 28, 2010

Hi Joe,

I've attached the new smart reports. It looks like Disk 2 has an even higher number of unreadable sectors now at 2718. That's not good. Luckily, I've gotten all the files I really needed off of it and have backups for the rest, tho these are mostly DVDs, which means I'll have to rip them again. PITA.

You had mentioned that Disk 3 was not looking great because of the high spin retry count. It could have been the power supply, I don't know. Up until a few days ago, it was a Seagate Freeagent External drive, about a year old if I remember correctly. It was in one of those external cases that doesn't have an on/off switch, so the thing would spin up whether I needed it or not. Concerning the "normalized" value as 100 and the failure value being 97, what do these numbers represent exactly? Or is that even important? The idea is that the drive fails if the "normalized" value drops under 97, is that right?

Additionally, 9000 hours seems like way more than I could have possibly put on it. Even if I was running it 24/7 for a year, that would be 8760 hours. Maybe I've had it longer than I thought. I'll have to check that out.

I don't know how to interpret the results of the "long" test. I did start and stop the test several times as I wasn't sure I had typed in the right characters. Is that what the report is saying?

These unreadable sectors on Disk 2, does this mean they are physically shot, or just might contain corrupted data? I mean, if I precleared that disk and wrote all 0s to it as Brit suggests, could that bring back the sectors? Sorry for the fundamental questions, I've not gone this deeply into this stuff before.

Adam

smart-2010-5-27-d3.txt

smart-2010-5-27-d2.txt

Joe L. · May 28, 2010

Hi Joe,

I've attached the new smart reports. It looks like Disk 2 has an even higher number of unreadable sectors now at 2718. That's not good. Luckily, I've gotten all the files I really needed off of it and have backups for the rest, tho these are mostly DVDs, which means I'll have to rip them again. PITA.

Good that you discovered this before deleting the files from your original disks. Sorry you'll need to re-rip some DVD's, but perhaps you don't. The bad sectors could have been in areas of the disk not used. (It is just impossible to tell unfortunately unless you have the original files to compare with)

You had mentioned that Disk 3 was not looking great because of the high spin retry count. It could have been the power supply, I don't know. Up until a few days ago, it was a Seagate Freeagent External drive, about a year old if I remember correctly. It was in one of those external cases that doesn't have an on/off switch, so the thing would spin up whether I needed it or not.

It did not have a fan in it either apparently, since at one time in the past it failed SMART with a high temperature error (see where it says "in_the_past" in the WHEN_FAILED column)

It could have been a power supply that did not let it spin up fast enough. 9000 hours is not a lot of hours for a drive, so it should not be showing too much wear.

It did successfully pass the long test

# 1 Extended offline Completed without error 00% 9091 -

Concerning the "normalized" value as 100 and the failure value being 97, what do these numbers represent exactly?

Only the manufacturer knows how it maps the failure to the normalized values, or even how the initial valuse are set , or the thresholds. If you assume that the initial starting value for spin-ups is 100, since many of your counters for that drive start there, then it has not budged at all to drop to 97. It might be it takes 30 failures to drop from 100 to 99, or 3000 failures to drop from 100 to 99. We have no way of knowing unless we analyze a lot of smart reports of failing drives.

Or is that even important?

No, we don't care about how they map raw counts to normalized values.

The idea is that the drive fails if the "normalized" value drops under 97, is that right?

As far as "smart" is considered, it will have failed, but that does not mean it stops working. It just reached the manufacturer's arbitrary threshold where they consider the retry rate to be excessive. The disk could work for years longer, just re-trying the spin-up more frequently than usual.

Additionally, 9000 hours seems like way more than I could have possibly put on it. Even if I was running it 24/7 for a year, that would be 8760 hours. Maybe I've had it longer than I thought. I'll have to check that out.

I've never seen that number incorrect. I'll bet it is older than you think. (Most have a manufacturer date on their case, take a look for it)

I don't know how to interpret the results of the "long" test. I did start and stop the test several times as I wasn't sure I had typed in the right characters. Is that what the report is saying?

It looks like the first two attempts at running the "long" test were aborted... Perhaps you did not have the spin-down timer disabled yet.

These unreadable sectors on Disk 2, does this mean they are physically shot, or just might contain corrupted data?

The disk drive already tried many times to read them. It gave up. Each sector has its own checksum stored with it on the disk as part of its own internal housekeeping. The disk was unable to read the sector and have it match the expected checksum, so it considers the sector un-readable. I have no idea if you can keep trying to re-read a sector marked for re-allocation. It really depends on the disk manufacturer. If I was them, I' try again, as there is really nothing to lose. But before it marked it as un-readable, it already tried many times (again, only the manufacturer knows how many times, and those are apparently trade secrets, cause they aren't telling anybody how they operate internally.)

I mean, if I precleared that disk and wrote all 0s to it as Brit suggests, could that bring back the sectors?

When the zeros are written to the drive (actually when anything is next written to the sectors marked for re-allocation) it will first attempt to write to the defective sectors and then re-read what it has written.

If it can successfully re-read what it had just written, then it will assume the sector was not properly written initially and that the current "write" worked, and it will not re-allocate the sector. Obviously, the old contents of the sector are over-written by the new values being written. In effect, it might be able to keep from re-allocating the sector, but the data in it is gone, replace by whatever is being written to the drive. If the attempt to write tot he sector marked for re-allocation fails, then an alternate sector from the pool of spare sectors is used instead, and the disk will use that sector whenever it is asked to get the original one. Again, the data in the original sector is gone... The newly allocated sector will have whatever is currently being written.

As I said earlier, most large disks today have several thousand spare sectors. It is expected that many will be re-allocated before the "smart" report shows a failure. In reality, even a few failed sectors could be terrible, it all depends where they are in the disk. You might not even notice it in the middle of a movie other than a glitch on the screen.

Looking at the "long" report results for disk2, it was unable to complete because of read failures.

The drive is a strong candidate for an RMA.

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 825 881859876

# 2 Extended offline Completed: read failure 90% 825 881859876

# 3 Extended offline Completed: read failure 90% 825 881859880

# 4 Extended offline Completed: read failure 90% 825 881859877

You can try the preclear_disk.sh script on disk2 to see if it will be able to successfully re-allocate all the sectors on the disk. Can't hurt once you get all your data off of it. I'd run it once, and then look at its output, then run it again to see that the sectors marked for re-allocation are not still increasing.

Sorry for the fundamental questions, I've not gone this deeply into this stuff before.

Adam

Most people have not either. You are in good company.

You can read all about SMART reporting here: http://en.wikipedia.org/wiki/S.M.A.R.T.

Joe L.

abernardi · May 28, 2010

Excellent feedback Joe, thank you.

Good that you discovered this before deleting the files fro your original disks. Sorry you'll need to re-rip some DVD's, but perhaps you don't. The bad sectors could have been in areas of the disk not used. (It is just impossible to tell unfortunately unless you have the original files to compare with

There really is no way to check the integrity of these movie files is there? The files appear intact, but, as you say, if there's a bad sector involved I could get a glitch. I wonder if there is a glitch, if it could damage the sound playback equipment, lets say, by sending some kind of pop or audio hit...

It's late, but I might start that preclear_disk.sh script tonight. I actually spent the evening installing an iStar 5 in 3 cage and moving some of the drives over to it. That's a new one for me too. I had to figure out how to get it into the case, because the case had rails built in for 5 1/4" devices. So I bent them back and the thing went right in. Man, that fan is loud! But everything seems to be up and running. Thanks again for the education.

Adam

abernardi · May 28, 2010

Yeah Joe, it isn't looking good for Disk 2. I started the preclear and it was reading nicely at 75-80 MB/s until about half way through and it suddenly dropped to zero at 3:15:15 into it. It gave a few more reports, then another report about 2 hours later with zero again. I guess I should stop it, yes?

Are my drives as messed up as I think they are???

Recommended Posts

abernardi

Link to comment

Joe L.

Link to comment

abernardi

Link to comment

BRiT

Link to comment

Joe L.

Link to comment

abernardi

Link to comment

Joe L.

Link to comment

abernardi

Link to comment

abernardi

Link to comment

Archived