Parity check errors repeatedly occuring

October 25, 200817 yr

Hello,

My first post here... about a problem

I've assembled a machine to run Unraid with 3x 1 TB Samsung disks (2x data + 1 parity). Normally there are no particular issues, disk temperatures are below 40 degrees Celsius; I'm satisfied with both reading and writing speeds.

Now the problem is, after I run parity check, there are 300 - 500 errors, as well as same amount of writes to disk 2. That means, I assume, unraid detects parity errors for these blocks and writes corrected values to disk 2. All of these errors occur at start of disk. Entire parity check lasts for about 2 hours, I've run it several times, all errors occur within first few minutes.

And... no matter how many times I run parity check, same happens again. Errors plus writes (corrections?) to disk 2.

I would be grateful for any advice what to do... I guess there is nothing wrong with SATA cables because almost entire disk 2 checks out well... and errors are probably not caused by bad blocks on disk, because these sectors would be isolated once and not reused. Then... why do errors repeatedly occur, and why only on start of disk 2?

Syslog attached.

October 26, 200817 yr

You might need to reboot to get the sectors reallocated. I eventually got a 750gb Samsung drive to stop showing errors by running the short smart check and rebooting each time to get the sector it stopped at reallocated. I only had 16 errors in the full sync though.

October 26, 200817 yr

Hello,

My first post here... about a problem

I've assembled a machine to run Unraid with 3x 1 TB Samsung disks (2x data + 1 parity). Normally there are no particular issues, disk temperatures are below 40 degrees Celsius; I'm satisfied with both reading and writing speeds.

Now the problem is, after I run parity check, there are 300 - 500 errors, as well as same amount of writes to disk 2. That means, I assume, unraid detects parity errors for these blocks and writes corrected values to disk 2. All of these errors occur at start of disk. Entire parity check lasts for about 2 hours, I've run it several times, all errors occur within first few minutes.

And... no matter how many times I run parity check, same happens again. Errors plus writes (corrections?) to disk 2.

I would be grateful for any advice what to do... I guess there is nothing wrong with SATA cables because almost entire disk 2 checks out well... and errors are probably not caused by bad blocks on disk, because these sectors would be isolated once and not reused. Then... why do errors repeatedly occur, and why only on start of disk 2?

Syslog attached.

You might want to run the smartctl short test and then post the smartctl output. it will give insight into the disk health. Yo might just have a disk with a bunch of sectors that are unreadable at the beginning of the disk. The SMART report will tell a lot about what is happening.

Instructions are in the wiki here: http://lime-technology.com/wiki/index.php?title=Troubleshooting#Hard_drive_failures

Joe L.

October 26, 200817 yr

If you want my opinion bin the drive and get another. Cost vs. risk this is a no brainer to me.

October 26, 200817 yr

Author

Thanks for replies so far.

Reboot doesn't help, error always happen again.

Attached smartctl output for all 3 drives (problematic one is smartc-suspicious.txt). I'm not too dumb , but since I'm not familiar with smartctl, I don't know how to interpret these results.

Getting a new disk is perfectly valid option as well; I would like first however to be reasonably sure it is the disk that is faulty (and not file system, SATA controller on MB, ...).

Few additional not so helpful remarks: transfer speeds for suspicious disk are same as for healthy one, temperatures are similar as well and not too high, there are no strange noises, power supply is 450W Seasonic over 1500 VA/865 W APC UPS.

October 26, 200817 yr

You definitely have a problem with the "suspicious" drive.

If you look at the attribute called "Reallocated_Sector_Ct", you'll see that 42 sectors have been reallocated. This is not a healthy sign.

By way of background, drives have a built in supply of spare sectors. If the drive detects that a sector has gone bad, it can take the bad sector out of service and map in a spare one. It is not so unusual so see a drive with 1 or even 2 such reallocations, although the norm by far is to have 0 reallocated sectors. A large number of reallocations is a sign that the drive is failing or defective. 42 is a hefty number for sector reallocations.

Your other 2 drives each have 0 reallocated sectors.

October 26, 200817 yr

The only errors I can see are media errors on your Disk 2, near the beginning (about 2 and a half minutes in), everything else is fine. The errors are accompanied by an indication of UNC, which I understand to mean 'Uncorrectable', but I don't know how significant that is here. I do not believe you should conclude that Disk 2 has failed yet, but it does have failing sectors. As suggested, run the SMART short or long test, and post another SMART report for it. After that and one more parity check (which will probably have more parity errors), you should be OK, but let us know if errors continue. I would also plan on monitoring it every month, compare the SMART reports, looking for continuing degradation. 42 reallocated sectors are not good, but acceptable if that number does not increase any further.

That means, I assume, unraid detects parity errors for these blocks and writes corrected values to disk 2.

Parity checks and syncs never write to data disks, they only correct the parity info on the parity drive.

You need to correct your time on this server, the logs are currently saying April 10, 2007.

October 26, 200817 yr

Author

The only errors I can see are media errors on your Disk 2, near the beginning (about 2 and a half minutes in), everything else is fine. The errors are accompanied by an indication of UNC, which I understand to mean 'Uncorrectable', but I don't know how significant that is here. I do not believe you should conclude that Disk 2 has failed yet, but it does have failing sectors. As suggested, run the SMART short or long test, and post another SMART report for it. After that and one more parity check (which will probably have more parity errors), you should be OK, but let us know if errors continue. I would also plan on monitoring it every month, compare the SMART reports, looking for continuing degradation. 42 reallocated sectors are not good, but acceptable if that number does not increase any further.

OK, I've started long smartctl test, will post results within few days.

That means, I assume, unraid detects parity errors for these blocks and writes corrected values to disk 2.

Parity checks and syncs never write to data disks, they only correct the parity info on the parity drive.

Hmm... After I started parity check, "Writes" (and "Errors") counter on Main page was incremented for disk 2. "Writes" for parity disk remained 0. Perhaps I interpreted that too literally...

You need to correct your time on this server, the logs are currently saying April 10, 2007.

Corrected now using telnet, for some reason whatever I wrote on Settings page and "Apply" didn't work.

October 26, 200817 yr

The only errors I can see are media errors on your Disk 2, near the beginning (about 2 and a half minutes in), everything else is fine. The errors are accompanied by an indication of UNC, which I understand to mean 'Uncorrectable', but I don't know how significant that is here. I do not believe you should conclude that Disk 2 has failed yet, but it does have failing sectors. As suggested, run the SMART short or long test, and post another SMART report for it. After that and one more parity check (which will probably have more parity errors), you should be OK, but let us know if errors continue. I would also plan on monitoring it every month, compare the SMART reports, looking for continuing degradation. 42 reallocated sectors are not good, but acceptable if that number does not increase any further.

OK, I've started long smartctl test, will post results within few days.

That means, I assume, unraid detects parity errors for these blocks and writes corrected values to disk 2.

Parity checks and syncs never write to data disks, they only correct the parity info on the parity drive.

Hmm... After I started parity check, "Writes" (and "Errors") counter on Main page was incremented for disk 2. "Writes" for parity disk remained 0. Perhaps I interpreted that too literally...

You need to correct your time on this server, the logs are currently saying April 10, 2007.

Corrected now using telnet, for some reason whatever I wrote on Settings page and "Apply" didn't work.

When the"read" errors occurred on disk2, the unRAID driver then reconstructed what it thought should be at the sector being read using parity and the other data drives. It then wrote that correct data back to disk2. It never needed to subsequently "write" to parity to correct it since it would match at that point. It is only because of the "read" errors you are seeing "writes" in the low level driver's attempt to fix the data. In fact, it is doing exactly that... as the subsequent "write" then reallocates the sector on the disk by the "SMART" code in the disk. It knows it reported a read error, and it knows a subsequent "write" to that same sector is a perfect opportunity to reallocate it and have the correct data in it.

(At least this is how I think it is working... most of us do not have read errors on our disks...)

Oh yes, the "long" test usually takes a few hours... not a few days. If you do a smartctl command to get the status it will let you know if it is done.

Joe L.

October 27, 200817 yr

The drive appears to be pretty new. I think I'd try to return it and get a new one.

October 27, 200817 yr

The Samsung RMA center in NJ had a 2 day turn around after receiving the doa drives I had. Just FYI.

Not bad considering WD took 8 days earlier to month to turn around one drive to me. WD sent back a newer revision of the drive with half the platters, which was very nice though!

October 27, 200817 yr

The Samsung RMA center in NJ had a 2 day turn around after receiving the doa drives I had. Just FYI.

Not bad considering WD took 8 days earlier to month to turn around one drive to me. WD sent back a newer revision of the drive with half the platters, which was very nice though!

I always do a cross ship. It costs nothing except a temporary charge on your credit card.

In addition, you get their packing material which sort of guarantees that they cannot complain about inadequate packing.

October 27, 200817 yr

Additional thoughts on this issue from a technical perspective ...

DigiMagic had some bad sectors, but they were found in the context of running a parity check. This is the best situation. What happens (or should happen according to my understanding) is:

1. A read error occurs (e.g., disk 2, sector 1000). The drive tries and retries and uses other features at its disposal (e.g, recalibration) to try and read the data.

2. If not successful, the drive marks the sector as "bad" meaning that it needs to be remapped, and informs the caller (in this case unRAID) that the read failed

3. unRAID notices the read error, and increments the "error" count on the drive (shown in far right column in "Main" tab of web site). It likely also increments the sync error count.

4. unRAID reads the corresponding sector 1000s from the other disks (including parity)

5. unRAID reconstructs what it should have read on disk 2 sector 1000

6. unRAID writes the recomputed data back to disk 2 sector 1000

7. The drive notices that a write has occurred on a sector requiring remapping. Instead of just writing back to the sector it already knows is bad, it performs the sector remap first, and then writes the reconstructed data to a shiny new sector

8. unRAID goes on about its merry way checking subsequent sectors ...

The net effect is that the sector error is corrected and no data is lost.

A bad sector is likely something that each of us has encountered at one time or another, so one would believe that this type of behavior might be relatively common over the life of an array. However, I have never seen a user post a scenario that sounded like a bad sector was found during a parity check, and have therefore never seen an actual situation where I believe this has been proven to work in practice.

In looking through the smartctl output, the drive has 42 sectors reallocated. There are no "errors" on the drive (errors are sometimes indicative of bad cabling, but can also indicate very serious drive problems). In looking through the log there are (if I counted right), 320 sector errors. If remapping occurs in larger "chunks", then perhaps these numbers are somewhat closer than they seem (if chunk size is 8, 42 * 8 = 336, not too far off from 320. One can rationalize this difference.) I'm just not sure, and it may vary by drive.

So the questoin remains, if DigiMagic reruns a parity check will it be 100% clean? If it is, then unRAID and the drive did exactly what they were supposed to do. But DigiMagic says he ran multiple parity checks and the errors keep repeating. (This particular syslog only contains a single parity check). If this is true it would be interesting to see what sectors fail on a subsequent (or prior) run. If it is the same sectors, the above procedure may not be working as expected (or something more insideous is happening like the spare sectors are themselves bad). If the sectors are different it would tend to indicate that the drive is self-destructing.

I noticed one thing rather curious in looking through the log concerning the order of the bad sector numbers. I would have expected that they would have steadily climbed from 0 to some very large number. However, I see chunks of errors starting at these sector numbers: 277778xx, 277781xx, 277768xx, 277791xx. The third chunk of errors are out of order. They should have occurred BEFORE the first set of errors if they were being reported consecutively . Seems odd ...

October 27, 200817 yr

Author

Oh yes, the "long" test usually takes a few hours... not a few days. If you do a smartctl command to get the status it will let you know if it is done.

Yeah but I have to sleep and go to work and take care of eight cats...

Few short things I managed to try: run long and short smartctl tests, afterwards run 20-25 parity checks (just few minutes to pass over critical part). Still always getting errors

Most often there are 448 errors, but it varies between ~ 300 and ~ 500.

Reallocated sector count continuously grows and was about 212 last time I checked.

Errors always happen at 1.4% of parity check.

I don't get it... I would understand there are X bad sectors, they are discovered once and isolated and end of story... but why do I constantly get new and new errors, on new sectors? By the way, I did press "Reset statistics" button always when repeating tests.

SMART overall health status is "passed" and unraid shows green circle... so I think disk would probably pass any test they would do in store and refuse to replace it until it dies completely, or at least SMART reports the equivalent of "almost dead".

My theory is that there is some bug in disk's firmware or logic chips, triggered by sequential accessing of sectors and particularly those corresponding to position of 1.4%... (I work in R&D of industrial LCD panels, so I know that bugs in firmware do happen... occasionally )

I'll post logs in a day or two... and search for updated firmware from Samsung, or some low-level disk check utility...

October 28, 200817 yr

Reallocated sector count continuously grows and was about 212 last time I checked.

That's cause enough for an RMA right there....

October 28, 200817 yr

If you want my opinion bin the drive and get another. Cost vs. risk this is a no brainer to me.

abandon that disk. rma or bin it.

October 28, 200817 yr

I don't get it... I would understand there are X bad sectors, they are discovered once and isolated and end of story... but why do I constantly get new and new errors, on new sectors? By the way, I did press "Reset statistics" button always when repeating tests.

SMART overall health status is "passed" and unraid shows green circle... so I think disk would probably pass any test they would do in store and refuse to replace it until it dies completely, or at least SMART reports the equivalent of "almost dead".

My theory is that there is some bug in disk's firmware or logic chips, triggered by sequential accessing of sectors and particularly those corresponding to position of 1.4%... (I work in R&D of industrial LCD panels, so I know that bugs in firmware do happen... occasionally Wink )

If the long test reports an LBA error and keeps on reporting it, then the drive needs to be RMA'ed.

If the reallocated sectors continuously grow, the drive will run out of spare sectors and the SMART status will change to FAILED.

I would like to see the smart log for a long test.

I would suggest booting from the manufactures diagnostic floppy and running their tests.

It's possible for a bad spot to develop on the disk and grow over time. If the format has probalms and the heads cannot track, subsequent neighbor sectors can be affected. I've seen this happen. If reallocated sectors just continue to grow RMA the drive.

When they stop, scrub the drive.

You can scrub the drive with the badblocks progran in write mode (this will destroy the data, so archive your data).

You can also scrub it with shred forcing 0's all over the drive.

Another tool is DBAN which is a bootable floppy to erase a drive, but can write patterns all over the drive thus forcing a reallocation.

I'll go back to, boot from the vendors diagnostic disk and if it reports error codes use that for the RMA process.

October 29, 200817 yr

Author

Attached smartctl log after long test (if I did it right: -t long to activate the test and -l selftest to print results).

I'll test the drive over weekend with Samsung's dedicated utility and get it replaced afterwards. Or just get a new drive if for some reason even Samsung says it passes...

October 29, 200817 yr

This dirve is headed for failure see some of the segments I cut and paste.

Notice the LONG test shows a read error.

Notice how the drive says it passed the health test yet there are still conditions present for failure.

This means that it's possible to get past them over time. What needs to happen is a write to every sector with a tool.

DD will work, badblocks in destructive mode will work too.

If you have data on this drive move it off.

Notice Current_Pending_Sectors is > 0

This means a sector can not be read, It will be re-allocated when written to. This is the reason to write to every sector.

If you could calculate the sector and write to it, it would be remapped.

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.

196 Reallocated_Event_Count 0x0032   095   095   000    Old_age   Always       -       212
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       [b]2[/b]
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       [b]1[/b]

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       109         27776974
# 2  Short offline       Aborted by host               20%       106         -
# 3  Short offline       Aborted by host               20%       106         -
# 4  Short offline       Aborted by host               20%       106         -
# 5  Short offline       Aborted by host               20%       106         -
# 6  Extended offline    Aborted by host               90%        89         -

Now if you can hold of on RMA'ing the drive for a few days, I would like to provide a NAGIOS plugin to test the drives health and see if it comes up with a warning. I mention this because I want to include it as part of a nagios or embeded shell test suite for drive healthy.

It wont write to the drive, it will just test it (very short) then report output on stdout, which I'll need you to capture.

I don't have any bad drives, so I can't test it out. LOL!

You game?

October 30, 200817 yr

Your last SMART report on sdc used smartctl v5.36, the earlier one used v5.38. Although it is still easy to see the increasingly bad numbers, it would be better to always use v5.38. If you turn these in to a Samsung rep, send a copy of the earlier and later ones, so they can confirm for themselves the rapid degradation.

October 30, 200817 yr

Please try this smart tool. It's from the nagios suite.

I'm curious if it will flag the drive in a warning condition or not.

Usage examples below

root@Atlas /usr/libexec #./check_ide_smart -d /dev/sda
Id=  1, Status=15 {PreFailure , OnLine }, Value=117, Threshold=  6, Passed
Id=  3, Status= 3 {PreFailure , OnLine }, Value= 92, Threshold=  0, Passed
Id=  4, Status=50 {Advisory    , OnLine }, Value=100, Threshold= 20, Passed
Id=  5, Status=51 {PreFailure , OnLine }, Value=100, Threshold= 36, Passed
Id=  7, Status=15 {PreFailure , OnLine }, Value= 58, Threshold= 30, Passed
Id=  9, Status=50 {Advisory    , OnLine }, Value= 97, Threshold=  0, Passed
Id= 10, Status=19 {PreFailure , OnLine }, Value=100, Threshold= 97, Passed
Id= 12, Status=50 {Advisory    , OnLine }, Value=100, Threshold= 20, Passed
Id=184, Status=50 {Advisory    , OnLine }, Value=100, Threshold= 99, Passed
Id=187, Status=50 {Advisory    , OnLine }, Value=100, Threshold=  0, Passed
Id=188, Status=50 {Advisory    , OnLine }, Value=100, Threshold=  0, Passed
Id=189, Status=58 {Advisory    , OnLine }, Value= 99, Threshold=  0, Passed
Id=190, Status=34 {Advisory    , OnLine }, Value= 68, Threshold= 45, Passed
Id=194, Status=34 {Advisory    , OnLine }, Value= 32, Threshold=  0, Passed
Id=195, Status=26 {Advisory    , OnLine }, Value= 51, Threshold=  0, Passed
Id=197, Status=18 {Advisory    , OnLine }, Value=100, Threshold=  0, Passed
Id=198, Status=16 {Advisory    , OffLine}, Value=100, Threshold=  0, Passed
Id=199, Status=62 {Advisory    , OnLine }, Value=200, Threshold=  0, Passed
OffLineStatus=130 {Completed}, AutoOffLine=Yes, OffLineTimeout=10 minutes
OffLineCapability=123 {Immediate Auto SuspendOnCmd}
SmartRevision=10, CheckSum=12, SmartCapability=3 {SaveOnStandBy AutoSave}
root@Atlas /usr/libexec #./check_ide_smart -d /dev/sda -q

October 31, 200817 yr

Author

I apologize for first asking for help, then being not too responsive in last few days, but I've got some very bad news at home.

It is no problem to keep the disk for several more weeks, I am not particularly pressed to replace it. I'll try the nagios tool and report back.

I can also simply keep the disk for experiments and add a new one, if that can be of some help - I planned anyway to buy registered version of unraid, even if staying at just 3 disks. (Edit: bought it.)

smartctl 5.36 must have been picked somewhere from unraid USB stick... I've put 5.38 into root, but when doing later report, forgot to enter full path to new executable /boot/smartctl.

October 31, 200817 yr

No problem,No Rush, lets see if the nagios plugin detects conditions we want to be alerted to.

Take care.

November 3, 200817 yr

Author

Please try this smart tool. It's from the nagios suite.

I'm curious if it will flag the drive in a warning condition or not.

Report attached.

"-q" option, I guess, returns a number of failed tests as a program return code, but I cannot recall how to capture that into a file. Nothing is printed on console (stdout).

Let me know if you want me to do more tests, otherwise I'll try your recommendation to write to every sector of the drive.

November 3, 200817 yr

Let me know if you want me to do more tests, otherwise I'll try your recommendation to write to every sector of the drive.

Thanks, that was informative, although not as I was hoping.

In the meantime if it were my drive, I might do a badblocks test on the whole drive to force writing to every sector.

I'm doing this now with some seagate 1TB drives (taking for ever) but I wanted to use the hard drive for a while before actually putting data on it.

on mine I did a badblocks -w /dev/sd? where ? = drive letter.

Note the -w is a destructive write. You can also do a -n if you do not want it to be destructive.

In my case I wanted to overwrite everything on purpose erasing what was once there.

This takes a long time. You may want to add the -s parameter. wish I did LOL!

Man page here: http://linux.die.net/man/8/badblocks

Parity check errors repeatedly occuring

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)