JustinChase Posted May 5, 2015 Author Share Posted May 5, 2015 I've updated to beta15, and I'm still seeing 5 parity errors on every parity check. How do I figure out what these 5 errors are, and either fix them, or tell unRAID to stop telling me about them? I don't think unRAID should keep reporting the same errors constantly, should it? Quote Link to comment
garycase Posted May 6, 2015 Share Posted May 6, 2015 What you're seeing is consistent with doing a non-correcting check -- in that case the system SHOULD keep telling you about the errors. But since you've been doing CORRECTING checks, something else is at play here ... and it's not at all clear what that might be. Unfortunately you indicated you don't have MD5's of your data; and I gather you also don't have good backups. So you really can't tell just which files are having issues. I'd do the following: (1) Generate MD5's for ALL of your files [This will take a LONG time]. (2) Re-generate your parity disk ... either via a New Config; or by Stopping the array; unassigning the parity drive (so no parity is assigned); Starting the array (so parity shows as missing); then Stopping the array and re-assigning the parity drive; and then Starting the array one more time. (3) After parity is regenerated, do a parity check. If you get errors this time, run a validation for all of the MD5's and see what has changed. Quote Link to comment
JustinChase Posted May 6, 2015 Author Share Posted May 6, 2015 Thanks for the advice. I wonder if unRAID just doesn't see/recognize the correcting bit, so i just started a non-correcting parity check, then when finished, I'll run another correcting check and see if it's any better. hopefully someone from LT can/will chime in about this before I need to start with your suggestions. however, if that doesn't happen, can you suggest how to generate the MD5 data for my array? Quote Link to comment
garycase Posted May 6, 2015 Share Posted May 6, 2015 It does "seem" like the parity check is simply not doing any corrections ... so you continuously get the same 5 errors. But clearly you've tried this -- and if you followed my suggestions quite a bit earlier already tried toggling the bit (although letting it run to completion may have a different impact). As for generating MD5's => I do this from Windows using the excellent Corz checksum utility. There are a few folks on the forum who have Linux-based utilities to do this -- they may chime in to give you that option, although I think Corz works just fine for me. http://corz.org/windows/software/checksum/ Quote Link to comment
JustinChase Posted May 13, 2015 Author Share Posted May 13, 2015 The intentionally non-correcting parity check showed 5 errors (no surprise). Then running a correcting parity check still shows 5 errors. Tom or jonp; can you give any advice or suggestions here? Thanks Quote Link to comment
JonathanM Posted May 13, 2015 Share Posted May 13, 2015 The intentionally non-correcting parity check showed 5 errors (no surprise). Then running a correcting parity check still shows 5 errors. Tom or jonp; can you give any advice or suggestions here? Thanks Are you saying that multiple correcting parity checks in a row all show the same errors? The statement above makes it sound like you didn't do another check after the correcting one to see if the errors had been corrected successfully. Quote Link to comment
trurl Posted May 13, 2015 Share Posted May 13, 2015 Did I overlook it, or is there no syslog in this thread? Doesn't a correcting parity check get logged differently than a non-correcting one? Quote Link to comment
itimpi Posted May 13, 2015 Share Posted May 13, 2015 The intentionally non-correcting parity check showed 5 errors (no surprise). Then running a correcting parity check still shows 5 errors. This would be expected as the first run did not do any corrections and the second one would find the same errors but this time should correct them. What is unexpected is any errors showing up in parity checks after doing the correcting parity check. Quote Link to comment
JustinChase Posted May 13, 2015 Author Share Posted May 13, 2015 Did I overlook it, or is there no syslog in this thread? Doesn't a correcting parity check get logged differently than a non-correcting one? Sorry, I've attached it now syslog.zip Quote Link to comment
trurl Posted May 13, 2015 Share Posted May 13, 2015 There are 2 parity checks in the log. The first was non-correcting. May 5 22:13:04 media kernel: mdcmd (122): spindown 10 May 5 22:42:08 media kernel: mdcmd (123): check NOCORRECT May 5 22:42:08 media kernel: md: recovery thread woken up ... May 5 22:42:08 media kernel: md: recovery thread checking parity... May 5 22:42:08 media kernel: md: using 2048k window, over a total of 3907018532 blocks. May 6 01:30:01 media logger: mover started ... May 6 01:31:58 media logger: mover finished May 6 05:16:50 media kernel: md: parity incorrect, sector=3519069768 May 6 05:16:50 media kernel: md: parity incorrect, sector=3519069776 May 6 05:16:50 media kernel: md: parity incorrect, sector=3519069784 May 6 05:16:50 media kernel: md: parity incorrect, sector=3519069792 May 6 05:16:50 media kernel: md: parity incorrect, sector=3519069800 May 6 06:43:44 media kernel: mdcmd (124): spindown 7 May 6 10:54:10 media kernel: mdcmd (125): spindown 2 May 6 10:54:10 media kernel: mdcmd (126): spindown 3 May 6 10:54:11 media kernel: mdcmd (127): spindown 4 May 6 10:54:11 media kernel: mdcmd (128): spindown 6 May 6 10:54:12 media kernel: mdcmd (129): spindown 8 May 6 10:54:12 media kernel: mdcmd (130): spindown 9 May 6 12:32:10 media kernel: mdcmd (131): spindown 9 May 6 12:32:12 media kernel: mdcmd (132): spindown 6 May 6 13:09:54 media kernel: md: sync done. time=52066sec May 6 13:09:54 media kernel: md: recovery thread sync completion status: 0 May 6 13:39:55 media kernel: mdcmd (133): spindown 5 May 9 18:38:05 media kernel: mdcmd (300): spindown 3 May 9 20:15:43 media kernel: mdcmd (301): check CORRECT May 9 20:15:43 media kernel: md: recovery thread woken up ... May 9 20:15:43 media kernel: md: recovery thread checking parity... May 9 20:15:43 media kernel: md: using 2048k window, over a total of 3907018532 blocks. May 9 20:55:27 media emhttp: read_line: client closed the connection May 9 20:55:29 media emhttp: read_line: client closed the connection May 10 01:30:01 media logger: mover started ... May 10 01:30:03 media logger: mover finished May 10 02:52:45 media kernel: md: correcting parity, sector=3519069768 May 10 02:52:45 media kernel: md: correcting parity, sector=3519069776 May 10 02:52:45 media kernel: md: correcting parity, sector=3519069784 May 10 02:52:45 media kernel: md: correcting parity, sector=3519069792 May 10 02:52:45 media kernel: md: correcting parity, sector=3519069800 May 10 04:19:16 media kernel: mdcmd (302): spindown 7 May 10 04:40:01 media apcupsd[3614]: apcupsd exiting, signal 15 May 10 04:40:01 media apcupsd[3614]: apcupsd shutdown succeeded May 10 04:40:04 media apcupsd[20808]: apcupsd 3.14.13 (02 February 2015) slackware startup succeeded May 10 04:40:04 media apcupsd[20808]: NIS server startup succeeded May 10 08:26:35 media kernel: mdcmd (303): spindown 2 May 10 08:26:36 media kernel: mdcmd (304): spindown 3 May 10 08:26:36 media kernel: mdcmd (305): spindown 4 May 10 08:26:37 media kernel: mdcmd (306): spindown 6 May 10 08:26:37 media kernel: mdcmd (307): spindown 8 May 10 08:49:20 media kernel: mdcmd (308): spindown 9 May 10 09:18:31 media shfs/user: shfs_rmdir: rmdir: /mnt/cache/video/Racing/Formula 1/_UNPACK_formula1 (39) Directory not empty May 10 09:18:31 media shfs/user: shfs_rmdir: rmdir: /mnt/cache/video/Racing/Formula 1/_UNPACK_formula1 (39) Directory not empty May 10 09:18:31 media shfs/user: shfs_rmdir: rmdir: /mnt/cache/video/Racing/Formula 1/formula1 (39) Directory not empty May 10 09:26:45 media kernel: mdcmd (309): spindown 8 May 10 09:26:54 media kernel: mdcmd (310): spindown 3 May 10 09:28:41 media kernel: mdcmd (311): spindown 6 May 10 09:41:54 media kernel: mdcmd (312): spindown 9 May 10 09:46:39 media kernel: mdcmd (313): spindown 2 May 10 10:43:53 media kernel: md: sync done. time=52089sec May 10 10:43:53 media kernel: md: recovery thread sync completion status: 0 May 10 10:53:30 media kernel: mdcmd (314): spindown 3 And the second was correcting. It found the same sectors that the previous non-correcting check did. So there is actually no syslog evidence that a subsequent parity check would still find parity errors. I guess you will have to do another so we can see if they are still there and are on the same sectors. Quote Link to comment
Wally Posted May 13, 2015 Share Posted May 13, 2015 Wow, this is bizarre, I have the exact same problem but what's even more amazing is that my 5 errors were at the same exact 5 sectors! I thought I would have found the bad drive while upgrading from Riserfs to xfs and upgrading my parity drive from a 4TB to a 5TB but now I have 5 sectors incorrect in a different location constantly. During my FS upgrade, I changed two drives at a time and did parity checks with the same errors each time. This is not a bad drive problem at all but some kind of error caused by the hardware or software. Before: Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069768 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069776 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069784 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069792 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069800 Now: Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606472 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606480 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606488 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606496 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606504 I have done at least 10 parity check with the first set of bad sectors and about 5 with the second. I hope we can get to the bottom of this. Quote Link to comment
trurl Posted May 13, 2015 Share Posted May 13, 2015 Wow, this is bizarre, I have the exact same problem but what's even more amazing is that my 5 errors were at the same exact 5 sectors! I thought I would have found the bad drive while upgrading from Riserfs to xfs and upgrading my parity drive from a 4TB to a 5TB but now I have 5 sectors incorrect in a different location constantly. During my FS upgrade, I changed two drives at a time and did parity checks with the same errors each time. This is not a bad drive problem at all but some kind of error caused by the hardware or software. Before: Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069768 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069776 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069784 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069792 Apr 4 18:28:39 unRAID kernel: md: correcting parity, sector=3519069800 Now: Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606472 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606480 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606488 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606496 Apr 7 07:57:11 unRAID kernel: md: correcting parity, sector=1177606504 I have done at least 10 parity check with the first set of bad sectors and about 5 with the second. I hope we can get to the bottom of this. So we have another report with same number of sectors and same sector numbers. I have 4TB parity also, and never have parity errors. I am on unRAID 6b15 currently, but have also had no parity errors going back a few years and including most of that time on v5. Most of that time was ReiserFS (of course), but currently all my array drives are XFS. If it is software related, then it seems like it must be software that is related to specific hardware. So might as well call it hardware related, and compare our hardware. I have 6 array drives including parity. All of these are WD Red 4 or 3 TB. I have 4 of those drives on ASUS H87-I mobo, and 2 on Rosewill RC-218. (2 mobo ports are SSD cache pool so not relevant, other 2 RC-218 ports currently unused.) What about your hardware? How does it compare? JustinChase, is your sig up to date re: hardware? Quote Link to comment
Frank1940 Posted May 13, 2015 Share Posted May 13, 2015 I have done at least 10 parity check with the first set of bad sectors and about 5 with the second. I hope we can get to the bottom of this. Could you expand on this statement. Did you first get the first set of bad sectors ten times in a row while running a CORRECTING parity check and then the second set of bad sectors five times in a row while running the CORRECTING parity check? If this was the case, did you change hardware configuration between the two sets? Or was it something else? What did you do to attempt to resolve the situation since you apparently knew (or felt) that something was not right? Quote Link to comment
Wally Posted May 13, 2015 Share Posted May 13, 2015 Frank1940, Everytime I ran a correcting parity check for about the last 10 times in the last year I got the same invalid sectors. I even changed parity drives and the location of the 5 changed but was still constant with every check. Earlier this year I upgraded to unRAID 6 and XFS and had to swap each drive out after copying it's data to a newly XFS formatted drive and running a parity calc and check too see if that drive was the bad one but after doing all 10 drives, the problem still persists. I also replaced the RAM just in case. My system consists of an Intel DZ75ML-45K motherboard with an I5-2320 CPU, 8GB of ram, Dell Perc 310 flashed to LSI firmware in IT mode, 10 data drives in 3TB and 4TB Western Digital and Hitachi mix, 5TB Toshiba parity drive and a Sandisk 480GB SSD cache drive. I'm now running the latest beta 6 unRAID software. I just ran a parity check with the same 5 errors and am running it again too see what happens. It takes about an hour and half for the errors to show up. Quote Link to comment
garycase Posted May 13, 2015 Share Posted May 13, 2015 Justin -- did you generate MD5's for all your data? I know it's time consuming, but it's the only way (short of a complete set of compares with a set of backups) to isolate exactly what files are being impacted. Even if you current have some "bad" files, generating a complete set of MD5's will at least provide a way to isolate what is changing with future checks. There are a variety of ways to do this -- I use the excellent Windows-based Corz Checksum utility => Just "point" it to one of your disks; right-click; and choose "Create Checksums" ... for a full multi-TB disk it will take a LONG time (a day or so) -- repeat for each disk until you have a complete set generated. Then run a parity check and see if you get errors (I assume you will) ... and then you just "point" to each disk and choose "Verify Checksums". Again, this will take a long time (less than the creation did) => but at least you'll know which files on which disk changed. If you repeat this process a couple times, you may be able to isolate where the errors are occurring. With any luck, they're always on the same disk ... in which case you can simply replace that disk to resolve it. I agree, however, that this is a STRANGE issue => it absolutely "feels" like the system isn't doing a correcting check. Quote Link to comment
dgaschk Posted May 14, 2015 Share Posted May 14, 2015 See here: http://lime-technology.com/forum/index.php?topic=19936.msg179806#msg179806 Quote Link to comment
trurl Posted May 14, 2015 Share Posted May 14, 2015 See here: http://lime-technology.com/forum/index.php?topic=19936.msg179806#msg179806 Learn something every day around here. Had no idea this issue was already a topic in the wiki. Quote Link to comment
garycase Posted May 14, 2015 Share Posted May 14, 2015 Interesting that this was a weird memory issue in the discussion from 3 years ago -- I suppose it's likely that may be the same thing here. Strange that it would always result in the same # of errors, however. Justin => certainly wouldn't hurt to just try a different set of RAM modules and see if the issue "goes away" Quote Link to comment
Wally Posted May 14, 2015 Share Posted May 14, 2015 I don't think this is a memory problem. There's almost no way that Justinchase and I could have the exact same 5 errors in the same locations. My 5 errors changed location and are consistant after I changed to a 5TB parity drive. I'm going to install a 4TB and see if my errors change back. Quote Link to comment
JustinChase Posted May 14, 2015 Author Share Posted May 14, 2015 Lots of good information everyone thanks. I'm working 14 hour days right now, so I don't have much time to deal with this, but as suggested, I'm running one more correcting parity check to confirm I still see 5 errors. I'm shocked that someone has the same errors in the same sectors; that can't be a coincidence. I believe my signature is current. I posted details on all my drives a few months ago, and I'll see if I can find a good list of all drive details to add here also. (added screenshot of all drives, not including cache) I'll keep checking this thread, and once the latest parity check is complete, I'll start creating MD5 checksums of my disks. It can't hurt, and may help with other issues in the future. However, I'm not convinced it will help with this issue. It seems like unRAID is just reporting the same errors over and over, not that I have 5 errors moving around on my disk/files. Meaning: MD5 checksums are unlikely to change, if it's an unRAID issue; as I suspect. But, i will start generating them, as more good information is usually a very good thing. Maybe I need to PM Tom about this issue? Quote Link to comment
JustinChase Posted May 14, 2015 Author Share Posted May 14, 2015 Interesting that this was a weird memory issue in the discussion from 3 years ago -- I suppose it's likely that may be the same thing here. Strange that it would always result in the same # of errors, however. Justin => certainly wouldn't hurt to just try a different set of RAM modules and see if the issue "goes away" Actually, reading to the end, he had a bent pin on his CPU causing his issues, not a memory issue as he first suspected. Quote Link to comment
garycase Posted May 14, 2015 Share Posted May 14, 2015 Interesting that this was a weird memory issue in the discussion from 3 years ago -- I suppose it's likely that may be the same thing here. Strange that it would always result in the same # of errors, however. Justin => certainly wouldn't hurt to just try a different set of RAM modules and see if the issue "goes away" Actually, reading to the end, he had a bent pin on his CPU causing his issues, not a memory issue as he first suspected. Actually I just noticed that -- I was re-reading the thread in more detail before posting again and pointing you to it again ... obviously I skimmed it a bit too quickly the first time In any event, this is definitely a STRANGE issue. MD5's should, however, help identify WHERE the errors are. An MD5 verification won't succeed if any bits have been changed in a file ... so if there are truly errors, it should identify them. UNLESS the errors are always in the parity drive. (certainly possible) => in which case your data is all fine, but you're ability to do a good rebuild of a failed disk is corrupt. Quote Link to comment
Frank1940 Posted May 14, 2015 Share Posted May 14, 2015 Did anyone else notice that the difference in sector counts was 8 in ALL the cases? Of course, everyone has already observed that the number of failures is always 5. Reading this thread makes me feel I have just stepped into a story from the Twilight Zone... Three cases which the indicators of a failure (which the apparent cause of a failure should be a random event --bad sectors on a hard disk-- with the repeatable symptoms) are virtually identical. Wally even replaced his parity drive and while the sectors involved changed but not the rest of the details are identical! Quote Link to comment
JustinChase Posted May 15, 2015 Author Share Posted May 15, 2015 I came home tonight to find the server had shut off; not sure why. When I started it back up, it did not start a parity check, so it must have shut down safely, somehow. I started a new correcting parity check. We'll see how it goes. Quote Link to comment
JustinChase Posted May 16, 2015 Author Share Posted May 16, 2015 I came home tonight to find the server had shut off; not sure why. When I started it back up, it did not start a parity check, so it must have shut down safely, somehow. I started a new correcting parity check. We'll see how it goes. 5 errors Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.