parity errors, possibly the same ones recurring from test to test

JustinChase · May 5, 2015

I've updated to beta15, and I'm still seeing 5 parity errors on every parity check.

How do I figure out what these 5 errors are, and either fix them, or tell unRAID to stop telling me about them? I don't think unRAID should keep reporting the same errors constantly, should it?

garycase · May 6, 2015

What you're seeing is consistent with doing a non-correcting check -- in that case the system SHOULD keep telling you about the errors.

But since you've been doing CORRECTING checks, something else is at play here ... and it's not at all clear what that might be.

Unfortunately you indicated you don't have MD5's of your data; and I gather you also don't have good backups. So you really can't tell just which files are having issues.

I'd do the following:

(1) Generate MD5's for ALL of your files [This will take a LONG time].

(2) Re-generate your parity disk ... either via a New Config; or by Stopping the array; unassigning the parity drive (so no parity is assigned); Starting the array (so parity shows as missing); then Stopping the array and re-assigning the parity drive; and then Starting the array one more time.

(3) After parity is regenerated, do a parity check. If you get errors this time, run a validation for all of the MD5's and see what has changed.

JustinChase · May 6, 2015

Thanks for the advice. I wonder if unRAID just doesn't see/recognize the correcting bit, so i just started a non-correcting parity check, then when finished, I'll run another correcting check and see if it's any better.

hopefully someone from LT can/will chime in about this before I need to start with your suggestions. however, if that doesn't happen, can you suggest how to generate the MD5 data for my array?

garycase · May 6, 2015

It does "seem" like the parity check is simply not doing any corrections ... so you continuously get the same 5 errors. But clearly you've tried this -- and if you followed my suggestions quite a bit earlier already tried toggling the bit (although letting it run to completion may have a different impact).

As for generating MD5's => I do this from Windows using the excellent Corz checksum utility. There are a few folks on the forum who have Linux-based utilities to do this -- they may chime in to give you that option, although I think Corz works just fine for me. http://corz.org/windows/software/checksum/

JustinChase · May 13, 2015

The intentionally non-correcting parity check showed 5 errors (no surprise). Then running a correcting parity check still shows 5 errors.

Tom or jonp; can you give any advice or suggestions here?

Thanks

JonathanM · May 13, 2015

The intentionally non-correcting parity check showed 5 errors (no surprise). Then running a correcting parity check still shows 5 errors.

Tom or jonp; can you give any advice or suggestions here?

Thanks

Are you saying that multiple correcting parity checks in a row all show the same errors? The statement above makes it sound like you didn't do another check after the correcting one to see if the errors had been corrected successfully.

trurl · May 13, 2015

Did I overlook it, or is there no syslog in this thread? Doesn't a correcting parity check get logged differently than a non-correcting one?

itimpi · May 13, 2015

The intentionally non-correcting parity check showed 5 errors (no surprise). Then running a correcting parity check still shows 5 errors.

This would be expected as the first run did not do any corrections and the second one would find the same errors but this time should correct them.

What is unexpected is any errors showing up in parity checks after doing the correcting parity check.

JustinChase · May 13, 2015

Did I overlook it, or is there no syslog in this thread? Doesn't a correcting parity check get logged differently than a non-correcting one?

Sorry, I've attached it now

syslog.zip

trurl · May 13, 2015

There are 2 parity checks in the log.

The first was non-correcting.

May  5 22:13:04 media kernel: mdcmd (122): spindown 10
May  5 22:42:08 media kernel: mdcmd (123): check NOCORRECT
May  5 22:42:08 media kernel: md: recovery thread woken up ...
May  5 22:42:08 media kernel: md: recovery thread checking parity...
May  5 22:42:08 media kernel: md: using 2048k window, over a total of 3907018532 blocks.
May  6 01:30:01 media logger: mover started
...
May  6 01:31:58 media logger: mover finished
May  6 05:16:50 media kernel: md: parity incorrect, sector=3519069768
May  6 05:16:50 media kernel: md: parity incorrect, sector=3519069776
May  6 05:16:50 media kernel: md: parity incorrect, sector=3519069784
May  6 05:16:50 media kernel: md: parity incorrect, sector=3519069792
May  6 05:16:50 media kernel: md: parity incorrect, sector=3519069800
May  6 06:43:44 media kernel: mdcmd (124): spindown 7
May  6 10:54:10 media kernel: mdcmd (125): spindown 2
May  6 10:54:10 media kernel: mdcmd (126): spindown 3
May  6 10:54:11 media kernel: mdcmd (127): spindown 4
May  6 10:54:11 media kernel: mdcmd (128): spindown 6
May  6 10:54:12 media kernel: mdcmd (129): spindown 8
May  6 10:54:12 media kernel: mdcmd (130): spindown 9
May  6 12:32:10 media kernel: mdcmd (131): spindown 9
May  6 12:32:12 media kernel: mdcmd (132): spindown 6
May  6 13:09:54 media kernel: md: sync done. time=52066sec
May  6 13:09:54 media kernel: md: recovery thread sync completion status: 0
May  6 13:39:55 media kernel: mdcmd (133): spindown 5

May  9 18:38:05 media kernel: mdcmd (300): spindown 3
May  9 20:15:43 media kernel: mdcmd (301): check CORRECT
May  9 20:15:43 media kernel: md: recovery thread woken up ...
May  9 20:15:43 media kernel: md: recovery thread checking parity...
May  9 20:15:43 media kernel: md: using 2048k window, over a total of 3907018532 blocks.
May  9 20:55:27 media emhttp: read_line: client closed the connection
May  9 20:55:29 media emhttp: read_line: client closed the connection
May 10 01:30:01 media logger: mover started
...
May 10 01:30:03 media logger: mover finished
May 10 02:52:45 media kernel: md: correcting parity, sector=3519069768
May 10 02:52:45 media kernel: md: correcting parity, sector=3519069776
May 10 02:52:45 media kernel: md: correcting parity, sector=3519069784
May 10 02:52:45 media kernel: md: correcting parity, sector=3519069792
May 10 02:52:45 media kernel: md: correcting parity, sector=3519069800
May 10 04:19:16 media kernel: mdcmd (302): spindown 7
May 10 04:40:01 media apcupsd[3614]: apcupsd exiting, signal 15
May 10 04:40:01 media apcupsd[3614]: apcupsd shutdown succeeded
May 10 04:40:04 media apcupsd[20808]: apcupsd 3.14.13 (02 February 2015) slackware startup succeeded
May 10 04:40:04 media apcupsd[20808]: NIS server startup succeeded
May 10 08:26:35 media kernel: mdcmd (303): spindown 2
May 10 08:26:36 media kernel: mdcmd (304): spindown 3
May 10 08:26:36 media kernel: mdcmd (305): spindown 4
May 10 08:26:37 media kernel: mdcmd (306): spindown 6
May 10 08:26:37 media kernel: mdcmd (307): spindown 8
May 10 08:49:20 media kernel: mdcmd (308): spindown 9
May 10 09:18:31 media shfs/user: shfs_rmdir: rmdir: /mnt/cache/video/Racing/Formula 1/_UNPACK_formula1 (39) Directory not empty
May 10 09:18:31 media shfs/user: shfs_rmdir: rmdir: /mnt/cache/video/Racing/Formula 1/_UNPACK_formula1 (39) Directory not empty
May 10 09:18:31 media shfs/user: shfs_rmdir: rmdir: /mnt/cache/video/Racing/Formula 1/formula1 (39) Directory not empty
May 10 09:26:45 media kernel: mdcmd (309): spindown 8
May 10 09:26:54 media kernel: mdcmd (310): spindown 3
May 10 09:28:41 media kernel: mdcmd (311): spindown 6
May 10 09:41:54 media kernel: mdcmd (312): spindown 9
May 10 09:46:39 media kernel: mdcmd (313): spindown 2
May 10 10:43:53 media kernel: md: sync done. time=52089sec
May 10 10:43:53 media kernel: md: recovery thread sync completion status: 0
May 10 10:53:30 media kernel: mdcmd (314): spindown 3

And the second was correcting. It found the same sectors that the previous non-correcting check did.

So there is actually no syslog evidence that a subsequent parity check would still find parity errors. I guess you will have to do another so we can see if they are still there and are on the same sectors.

Wally · May 13, 2015

Wow, this is bizarre, I have the exact same problem but what's even more amazing is that my 5 errors were at the same exact 5 sectors! I thought I would have found the bad drive while upgrading from Riserfs to xfs and upgrading my parity drive from a 4TB to a 5TB but now I have 5 sectors incorrect in a different location constantly. During my FS upgrade, I changed two drives at a time and did parity checks with the same errors each time. This is not a bad drive problem at all but some kind of error caused by the hardware or software.

Before:

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069768

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069776

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069784

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069792

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069800

Now:

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606472

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606480

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606488

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606496

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606504

I have done at least 10 parity check with the first set of bad sectors and about 5 with the second. I hope we can get to the bottom of this.

trurl · May 13, 2015

Wow, this is bizarre, I have the exact same problem but what's even more amazing is that my 5 errors were at the same exact 5 sectors! I thought I would have found the bad drive while upgrading from Riserfs to xfs and upgrading my parity drive from a 4TB to a 5TB but now I have 5 sectors incorrect in a different location constantly. During my FS upgrade, I changed two drives at a time and did parity checks with the same errors each time. This is not a bad drive problem at all but some kind of error caused by the hardware or software.

Before:

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069768

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069776

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069784

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069792

Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069800

Now:

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606472

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606480

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606488

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606496

Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606504

I have done at least 10 parity check with the first set of bad sectors and about 5 with the second. I hope we can get to the bottom of this.

So we have another report with same number of sectors and same sector numbers.

I have 4TB parity also, and never have parity errors. I am on unRAID 6b15 currently, but have also had no parity errors going back a few years and including most of that time on v5. Most of that time was ReiserFS (of course), but currently all my array drives are XFS. If it is software related, then it seems like it must be software that is related to specific hardware. So might as well call it hardware related, and compare our hardware.

I have 6 array drives including parity. All of these are WD Red 4 or 3 TB. I have 4 of those drives on ASUS H87-I mobo, and 2 on Rosewill RC-218. (2 mobo ports are SSD cache pool so not relevant, other 2 RC-218 ports currently unused.)

What about your hardware? How does it compare?

JustinChase, is your sig up to date re: hardware?

Frank1940 · May 13, 2015

I have done at least 10 parity check with the first set of bad sectors and about 5 with the second. I hope we can get to the bottom of this.

Could you expand on this statement. Did you first get the first set of bad sectors ten times in a row while running a CORRECTING parity check and then the second set of bad sectors five times in a row while running the CORRECTING parity check? If this was the case, did you change hardware configuration between the two sets? Or was it something else? What did you do to attempt to resolve the situation since you apparently knew (or felt) that something was not right?

Wally · May 13, 2015

Frank1940,

Everytime I ran a correcting parity check for about the last 10 times in the last year I got the same invalid sectors. I even changed parity drives and the location of the 5 changed but was still constant with every check. Earlier this year I upgraded to unRAID 6 and XFS and had to swap each drive out after copying it's data to a newly XFS formatted drive and running a parity calc and check too see if that drive was the bad one but after doing all 10 drives, the problem still persists. I also replaced the RAM just in case.

My system consists of an Intel DZ75ML-45K motherboard with an I5-2320 CPU, 8GB of ram, Dell Perc 310 flashed to LSI firmware in IT mode, 10 data drives in 3TB and 4TB Western Digital and Hitachi mix, 5TB Toshiba parity drive and a Sandisk 480GB SSD cache drive. I'm now running the latest beta 6 unRAID software.

I just ran a parity check with the same 5 errors and am running it again too see what happens. It takes about an hour and half for the errors to show up.

garycase · May 13, 2015

Justin -- did you generate MD5's for all your data?

I know it's time consuming, but it's the only way (short of a complete set of compares with a set of backups) to isolate exactly what files are being impacted.

Even if you current have some "bad" files, generating a complete set of MD5's will at least provide a way to isolate what is changing with future checks.

There are a variety of ways to do this -- I use the excellent Windows-based Corz Checksum utility => Just "point" it to one of your disks; right-click; and choose "Create Checksums" ... for a full multi-TB disk it will take a LONG time (a day or so) -- repeat for each disk until you have a complete set generated.

Then run a parity check and see if you get errors (I assume you will) ... and then you just "point" to each disk and choose "Verify Checksums". Again, this will take a long time (less than the creation did) => but at least you'll know which files on which disk changed.

If you repeat this process a couple times, you may be able to isolate where the errors are occurring. With any luck, they're always on the same disk ... in which case you can simply replace that disk to resolve it.

I agree, however, that this is a STRANGE issue => it absolutely "feels" like the system isn't doing a correcting check.

dgaschk · May 14, 2015

See here: http://lime-technology.com/forum/index.php?topic=19936.msg179806#msg179806

trurl · May 14, 2015

See here: http://lime-technology.com/forum/index.php?topic=19936.msg179806#msg179806

Learn something every day around here. Had no idea this issue was already a topic in the wiki.

garycase · May 14, 2015

Interesting that this was a weird memory issue in the discussion from 3 years ago -- I suppose it's likely that may be the same thing here. Strange that it would always result in the same # of errors, however.

Justin => certainly wouldn't hurt to just try a different set of RAM modules and see if the issue "goes away"

Wally · May 14, 2015

I don't think this is a memory problem. There's almost no way that Justinchase and I could have the exact same 5 errors in the same locations. My 5 errors changed location and are consistant after I changed to a 5TB parity drive. I'm going to install a 4TB and see if my errors change back.

JustinChase · May 14, 2015

Lots of good information everyone thanks. I'm working 14 hour days right now, so I don't have much time to deal with this, but as suggested, I'm running one more correcting parity check to confirm I still see 5 errors.

I'm shocked that someone has the same errors in the same sectors; that can't be a coincidence.

I believe my signature is current. I posted details on all my drives a few months ago, and I'll see if I can find a good list of all drive details to add here also. (added screenshot of all drives, not including cache)

I'll keep checking this thread, and once the latest parity check is complete, I'll start creating MD5 checksums of my disks. It can't hurt, and may help with other issues in the future. However, I'm not convinced it will help with this issue. It seems like unRAID is just reporting the same errors over and over, not that I have 5 errors moving around on my disk/files. Meaning: MD5 checksums are unlikely to change, if it's an unRAID issue; as I suspect.

But, i will start generating them, as more good information is usually a very good thing.

Maybe I need to PM Tom about this issue?

JustinChase · May 14, 2015

Interesting that this was a weird memory issue in the discussion from 3 years ago -- I suppose it's likely that may be the same thing here. Strange that it would always result in the same # of errors, however.

Justin => certainly wouldn't hurt to just try a different set of RAM modules and see if the issue "goes away"

Actually, reading to the end, he had a bent pin on his CPU causing his issues, not a memory issue as he first suspected.

garycase · May 14, 2015

Interesting that this was a weird memory issue in the discussion from 3 years ago -- I suppose it's likely that may be the same thing here. Strange that it would always result in the same # of errors, however.

Justin => certainly wouldn't hurt to just try a different set of RAM modules and see if the issue "goes away"

Actually, reading to the end, he had a bent pin on his CPU causing his issues, not a memory issue as he first suspected.

Actually I just noticed that -- I was re-reading the thread in more detail before posting again and pointing you to it again ... obviously I skimmed it a bit too quickly the first time

In any event, this is definitely a STRANGE issue.

MD5's should, however, help identify WHERE the errors are. An MD5 verification won't succeed if any bits have been changed in a file ... so if there are truly errors, it should identify them. UNLESS the errors are always in the parity drive. (certainly possible) => in which case your data is all fine, but you're ability to do a good rebuild of a failed disk is corrupt.

Frank1940 · May 14, 2015

Did anyone else notice that the difference in sector counts was 8 in ALL the cases? Of course, everyone has already observed that the number of failures is always 5.

Reading this thread makes me feel I have just stepped into a story from the Twilight Zone... Three cases which the indicators of a failure (which the apparent cause of a failure should be a random event --bad sectors on a hard disk-- with the repeatable symptoms) are virtually identical. Wally even replaced his parity drive and while the sectors involved changed but not the rest of the details are identical!

JustinChase · May 15, 2015

I came home tonight to find the server had shut off; not sure why. When I started it back up, it did not start a parity check, so it must have shut down safely, somehow.

I started a new correcting parity check. We'll see how it goes.

JustinChase · May 16, 2015

I came home tonight to find the server had shut off; not sure why. When I started it back up, it did not start a parity check, so it must have shut down safely, somehow.

I started a new correcting parity check. We'll see how it goes.

5 errors

parity errors, possibly the same ones recurring from test to test

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation