One parity error at every check


Recommended Posts

This is the second time that I saw this (one and only) parity error.

 

Must have been around 2 months ago when it appeared the first time.

At the very beginning of the parity check I get one parity error.

 

The most recent (monthly) parity check also threw this error.

It seems that the parity error is not getting fixed or what?

 

How can I find out who (drive) is causing it and why it isn't corrected permanently?

I will look for the syslog and attach it when I'm back home.

 

 

 

Link to comment
  • 2 weeks later...

Here is the syslog.

There is no pointer to any kind of error afaik.

 

May  1 21:40:14 Tuerke kernel: md: correcting parity, sector=65680
.
.
.
.
May  2 12:05:32 Tuerke kernel: md: sync done. time=51919sec
May  2 12:05:32 Tuerke kernel: md: recovery thread sync completion status: 0

 

But this sector 65680 is incorrect at every parity check.

 

syslog-20140503-031845_correcting_parity.txt

Link to comment

Probably a memory error - a very subtle one. 

 

If a memory error causes a block to show up as a parity error, unRaid will incorrectly update parity. It makes sense that the very next check will find this mistake (presumably the memory error will not occur twice at the same place) and put things right again. Make sense?

Link to comment

Which drive?

Data or parity?

 

As far as I understand (unraid philosophy) the data drives must be OK,

otherwise there would be a read error.

In that case unraid would write the information again on that drive (calculated from parity).

Hopefully that is not named a "parity error" also.

 

It seems I have a permanent parity error on one and the same location.

Obviously, all drives are readable but the information in that place is changing or it is not written permanently to the parity drive.

 

Should I run an extended smartctl -t on the array drives?

 

 

Link to comment

I had a reoccuring error 128 sync errors at each parity check same location parity and at the time drive 10.  Replaced parity and drive 10 still the same 128 errors but different array drive this time.  So I updated my RES2SV240 firmware and my M1015 firmware and no more sync errors so far.  I'm sure you have your firmware's updated since I used your contribution for a Perc H310 last night but you might try updating controller and SAS expander if you use one.

Link to comment

Makes as much sense as anything. I do not believe this is an unRaid issue or defect. Seems hardware / firmware related. Seems very odd that it would occur at a single specific location.

 

If you are technically oriented you could convert the block number to a disk offset and use the 'dd' command to read the contents of each disk at that location repeatedly, looking for variability in returned results. Can't think of any other way to diagnose further.

Link to comment

As bjp999 suggests there might be an issue with a drive not always returning the same data each time a block is read. 

 

I had a battle with this that is reported here:

 

http://lime-technology.com/forum/index.php?topic=11515.msg109840#msg109840

 

though this sort of thing appears to be very rare, so might not be your case at all.  If this is the cause you have to test all the drives by reading the blocks in the region that the parity error is reported.  You do this many times and if you have this error on one of those drives the read will occasionally return different data (even though the drives are not being written to).

 

Regards,

 

Stephen

 

Link to comment
  • 2 weeks later...

Here I am again - did the monthly parity check and guess what...

Jun  1 23:54:49 Tuerke kernel: md: recovery thread woken up ...
Jun  1 23:54:49 Tuerke kernel: md: recovery thread checking parity...
Jun  1 23:54:49 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks.
Jun  1 23:54:50 Tuerke kernel: md: correcting parity, sector=65680
.
.
.

 

:-\

 

Link to comment

Here I am again - did the monthly parity check and guess what...

Jun  1 23:54:50 Tuerke kernel: md: correcting parity, sector=65680

 

:-\

It's not entirely clear from your previous posts, so I would like to revisit the correcting vs. non-correcting again. I assume since you state it's the monthly check, that it is non-correcting, despite what the log says. Have you manually started and run to completion a parity check, being absolutely sure that it is a correcting check? I would run two manually started correcting checks in a row so they show up on the same syslog, and then post it.
Link to comment

Sorry for being unclear.

It is a correcting check all the time.

I started it for a few seconds just now and the error is there again.

Since it's at the very beginning I won't have to wait long for it.

 

Jun  3 00:38:36 Tuerke kernel: mdcmd (96): check CORRECT Jun  3 00:38:36 Tuerke kernel: md: recovery thread woken up ...
Jun  3 00:38:36 Tuerke kernel: md: recovery thread checking parity...
Jun  3 00:38:37 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks.
Jun  3 00:38:38 Tuerke auto_s3_sleep: HDD activity detected. Active HDDs: 12
Jun  3 00:38:38 Tuerke kernel: md: correcting parity, sector=65680

 

Link to comment
  • 4 weeks later...

Hey guys,

I've been reading through your posts and it seems this kind of error is an example why a correcting parity check is not always recommended - although this error seems to be very rare.

Now I'm determined to find that drive that causing/delivering the swapping bit.

Based on your posted snippets, I put together a script that does the md5 checksums for my drives.

 

The only thing that I need to know now is where do I have to search the drives?

Of course, since the error is popping up very soon after I pressing the button, it has to be from 0 to ??? but I would like to understand what I'm doing here.

While looking up for sectors and blocks I ended up totally confused.

 

dd needs a block location.

dd is by default using 512 byte blocks.

block length x amount of blocks = drive size

 

Syslog.

May  1 21:39:22 Tuerke kernel: scsi 4:0:0:0: Direct-Access     ATA      WDC WD30EZRX-00M 80.0 PQ: 0 ANSI: 5
May  1 21:39:22 Tuerke kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0
May  1 21:39:22 Tuerke kernel: sd 4:0:0:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
May  1 21:39:22 Tuerke kernel: sd 4:0:0:0: [sdb] 4096-byte physical blocks

This is a 3TB drive.

According to the equation I get: 3000592982016 bytes (512-byte blocks) logical block adress (LBA)

 

OK so far - understood = easy!

 

Now back to my error:

May  1 21:40:12 Tuerke kernel: mdcmd (51): check CORRECT
May  1 21:40:12 Tuerke kernel: md: recovery thread woken up ...
May  1 21:40:12 Tuerke kernel: md: recovery thread checking parity...
May  1 21:40:12 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks.
May  1 21:40:14 Tuerke kernel: md: correcting parity, sector=65680

Wait, this 3907018532 block are 2TB only. The parity disk is 4TB - shouldn't it be somewhat about

7814037064 blocks?

 

Next line refers to sector=65680 of (parity disk I suppose).

As I read up, the term "sectors" comes from the CHS adressing.

Is sector 65680 the same as LBA 65680?

That would result in position 33628160 bytes on the disk.

 

And finally, in addition to all this comes the file system block size on top - very nice - more confusion.

While most drives (there are new drives with real (physical) 4k blocks) have physically 512 byte blocks,

the file system can define it's own block size.

That means we get a different amount of blocks for the same drive.

 

How does this interfere with my task?

May  1 21:39:34 Tuerke kernel: REISERFS (device md1): journal params: device md1, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Size 8192 - is this supposed to be the block size?  :o

 

May  1 21:40:12 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks.

What blocksize is related to this amount of blocks?  :o

 

I'm very keen on seeing your answers. I'm lost.

 

And finally:

May  1 21:39:33 Tuerke kernel: unraid: allocating 71544K for 1536 stripes (11 disks)
May  1 21:39:33 Tuerke kernel: md1: running, size: 245117344 blocks
May  1 21:39:33 Tuerke kernel: md2: running, size: 488386552 blocks
May  1 21:39:33 Tuerke kernel: md3: running, size: 1465138552 blocks
May  1 21:39:33 Tuerke kernel: md4: running, size: 1465138552 blocks
May  1 21:39:33 Tuerke kernel: md5: running, size: 1465138552 blocks
May  1 21:39:33 Tuerke kernel: md6: running, size: 390711352 blocks
May  1 21:39:33 Tuerke kernel: md7: running, size: 2930266532 blocks
May  1 21:39:33 Tuerke kernel: md8: running, size: 2930266532 blocks
May  1 21:39:33 Tuerke kernel: md9: running, size: 3907018532 blocks
May  1 21:39:33 Tuerke kernel: md10: running, size: 245117344 blocks

 

Man, I can tell you that I'm really lost!  ???

Link to comment

Had this problem once, back when I used unraid 4.7.  Every parity check would result in a correction.  Ran smart checks until I was blue in the face.  Ran memory checks for hours.  Even did file system checks what would find errors and correct them.  Not sure if you have run them yet to see your results.  This was all about two years ago that this happened.  Since I rule out it was not a drive, I upgraded my hardware.  I had a Sempron 140 in a consumer mobo with cheap controller cards.  (bunch of two ports).  I'm still using all the same drives today and I have not had a problem since I upgraded to my supermicro mobo.  Knocking on wood:)

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.