Fireball3 Posted May 12, 2014 Share Posted May 12, 2014 This is the second time that I saw this (one and only) parity error. Must have been around 2 months ago when it appeared the first time. At the very beginning of the parity check I get one parity error. The most recent (monthly) parity check also threw this error. It seems that the parity error is not getting fixed or what? How can I find out who (drive) is causing it and why it isn't corrected permanently? I will look for the syslog and attach it when I'm back home. Quote Link to comment
Frank1940 Posted May 12, 2014 Share Posted May 12, 2014 Is this a 'Correcting Parity Check' or a 'Non-correcting Parity Check'? Quote Link to comment
Fireball3 Posted May 13, 2014 Author Share Posted May 13, 2014 It is the correcting check. Had no time to collect the log yesterday. I will do at the next opportunity. Quote Link to comment
Fireball3 Posted May 22, 2014 Author Share Posted May 22, 2014 Here is the syslog. There is no pointer to any kind of error afaik. May 1 21:40:14 Tuerke kernel: md: correcting parity, sector=65680 . . . . May 2 12:05:32 Tuerke kernel: md: sync done. time=51919sec May 2 12:05:32 Tuerke kernel: md: recovery thread sync completion status: 0 But this sector 65680 is incorrect at every parity check. syslog-20140503-031845_correcting_parity.txt Quote Link to comment
SSD Posted May 22, 2014 Share Posted May 22, 2014 Probably a memory error - a very subtle one. If a memory error causes a block to show up as a parity error, unRaid will incorrectly update parity. It makes sense that the very next check will find this mistake (presumably the memory error will not occur twice at the same place) and put things right again. Make sense? Quote Link to comment
Fireball3 Posted May 22, 2014 Author Share Posted May 22, 2014 presumably the memory error will not occur twice at the same place Probably not and the error shouldn't reappear at the same place after the second correcting check but it does! Quote Link to comment
SSD Posted May 22, 2014 Share Posted May 22, 2014 Did you check your drives SMART attributes? Think you should try again to confirm it is reproducible? Quote Link to comment
Fireball3 Posted May 22, 2014 Author Share Posted May 22, 2014 Which drive? Data or parity? As far as I understand (unraid philosophy) the data drives must be OK, otherwise there would be a read error. In that case unraid would write the information again on that drive (calculated from parity). Hopefully that is not named a "parity error" also. It seems I have a permanent parity error on one and the same location. Obviously, all drives are readable but the information in that place is changing or it is not written permanently to the parity drive. Should I run an extended smartctl -t on the array drives? Quote Link to comment
BobPhoenix Posted May 22, 2014 Share Posted May 22, 2014 I had a reoccuring error 128 sync errors at each parity check same location parity and at the time drive 10. Replaced parity and drive 10 still the same 128 errors but different array drive this time. So I updated my RES2SV240 firmware and my M1015 firmware and no more sync errors so far. I'm sure you have your firmware's updated since I used your contribution for a Perc H310 last night but you might try updating controller and SAS expander if you use one. Quote Link to comment
SSD Posted May 22, 2014 Share Posted May 22, 2014 Makes as much sense as anything. I do not believe this is an unRaid issue or defect. Seems hardware / firmware related. Seems very odd that it would occur at a single specific location. If you are technically oriented you could convert the block number to a disk offset and use the 'dd' command to read the contents of each disk at that location repeatedly, looking for variability in returned results. Can't think of any other way to diagnose further. Quote Link to comment
vca Posted May 22, 2014 Share Posted May 22, 2014 As bjp999 suggests there might be an issue with a drive not always returning the same data each time a block is read. I had a battle with this that is reported here: http://lime-technology.com/forum/index.php?topic=11515.msg109840#msg109840 though this sort of thing appears to be very rare, so might not be your case at all. If this is the cause you have to test all the drives by reading the blocks in the region that the parity error is reported. You do this many times and if you have this error on one of those drives the read will occasionally return different data (even though the drives are not being written to). Regards, Stephen Quote Link to comment
Fireball3 Posted June 2, 2014 Author Share Posted June 2, 2014 Here I am again - did the monthly parity check and guess what... Jun 1 23:54:49 Tuerke kernel: md: recovery thread woken up ... Jun 1 23:54:49 Tuerke kernel: md: recovery thread checking parity... Jun 1 23:54:49 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks. Jun 1 23:54:50 Tuerke kernel: md: correcting parity, sector=65680 . . . Quote Link to comment
JonathanM Posted June 2, 2014 Share Posted June 2, 2014 Here I am again - did the monthly parity check and guess what... Jun 1 23:54:50 Tuerke kernel: md: correcting parity, sector=65680 It's not entirely clear from your previous posts, so I would like to revisit the correcting vs. non-correcting again. I assume since you state it's the monthly check, that it is non-correcting, despite what the log says. Have you manually started and run to completion a parity check, being absolutely sure that it is a correcting check? I would run two manually started correcting checks in a row so they show up on the same syslog, and then post it. Quote Link to comment
Fireball3 Posted June 2, 2014 Author Share Posted June 2, 2014 Sorry for being unclear. It is a correcting check all the time. I started it for a few seconds just now and the error is there again. Since it's at the very beginning I won't have to wait long for it. Jun 3 00:38:36 Tuerke kernel: mdcmd (96): check CORRECT Jun 3 00:38:36 Tuerke kernel: md: recovery thread woken up ... Jun 3 00:38:36 Tuerke kernel: md: recovery thread checking parity... Jun 3 00:38:37 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks. Jun 3 00:38:38 Tuerke auto_s3_sleep: HDD activity detected. Active HDDs: 12 Jun 3 00:38:38 Tuerke kernel: md: correcting parity, sector=65680 Quote Link to comment
lionelhutz Posted June 3, 2014 Share Posted June 3, 2014 I threw together a Wiki page for this. Some others added some good info and added links to some other good test scripts based on the idea. It should help you troubleshoot this issue. http://lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors Quote Link to comment
Fireball3 Posted June 3, 2014 Author Share Posted June 3, 2014 Tank you for the link to the wiki. In my log I see the sector as location of the fault. The wiki refers to the blocks. How can I get that in line? Quote Link to comment
Fireball3 Posted June 27, 2014 Author Share Posted June 27, 2014 Hey guys, I've been reading through your posts and it seems this kind of error is an example why a correcting parity check is not always recommended - although this error seems to be very rare. Now I'm determined to find that drive that causing/delivering the swapping bit. Based on your posted snippets, I put together a script that does the md5 checksums for my drives. The only thing that I need to know now is where do I have to search the drives? Of course, since the error is popping up very soon after I pressing the button, it has to be from 0 to but I would like to understand what I'm doing here. While looking up for sectors and blocks I ended up totally confused. dd needs a block location. dd is by default using 512 byte blocks. block length x amount of blocks = drive size Syslog. May 1 21:39:22 Tuerke kernel: scsi 4:0:0:0: Direct-Access ATA WDC WD30EZRX-00M 80.0 PQ: 0 ANSI: 5 May 1 21:39:22 Tuerke kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0 May 1 21:39:22 Tuerke kernel: sd 4:0:0:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB) May 1 21:39:22 Tuerke kernel: sd 4:0:0:0: [sdb] 4096-byte physical blocks This is a 3TB drive. According to the equation I get: 3000592982016 bytes (512-byte blocks) logical block adress (LBA) OK so far - understood = easy! Now back to my error: May 1 21:40:12 Tuerke kernel: mdcmd (51): check CORRECT May 1 21:40:12 Tuerke kernel: md: recovery thread woken up ... May 1 21:40:12 Tuerke kernel: md: recovery thread checking parity... May 1 21:40:12 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks. May 1 21:40:14 Tuerke kernel: md: correcting parity, sector=65680 Wait, this 3907018532 block are 2TB only. The parity disk is 4TB - shouldn't it be somewhat about 7814037064 blocks? Next line refers to sector=65680 of (parity disk I suppose). As I read up, the term "sectors" comes from the CHS adressing. Is sector 65680 the same as LBA 65680? That would result in position 33628160 bytes on the disk. And finally, in addition to all this comes the file system block size on top - very nice - more confusion. While most drives (there are new drives with real (physical) 4k blocks) have physically 512 byte blocks, the file system can define it's own block size. That means we get a different amount of blocks for the same drive. How does this interfere with my task? May 1 21:39:34 Tuerke kernel: REISERFS (device md1): journal params: device md1, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Size 8192 - is this supposed to be the block size? May 1 21:40:12 Tuerke kernel: md: using 2560k window, over a total of 3907018532 blocks. What blocksize is related to this amount of blocks? I'm very keen on seeing your answers. I'm lost. And finally: May 1 21:39:33 Tuerke kernel: unraid: allocating 71544K for 1536 stripes (11 disks) May 1 21:39:33 Tuerke kernel: md1: running, size: 245117344 blocks May 1 21:39:33 Tuerke kernel: md2: running, size: 488386552 blocks May 1 21:39:33 Tuerke kernel: md3: running, size: 1465138552 blocks May 1 21:39:33 Tuerke kernel: md4: running, size: 1465138552 blocks May 1 21:39:33 Tuerke kernel: md5: running, size: 1465138552 blocks May 1 21:39:33 Tuerke kernel: md6: running, size: 390711352 blocks May 1 21:39:33 Tuerke kernel: md7: running, size: 2930266532 blocks May 1 21:39:33 Tuerke kernel: md8: running, size: 2930266532 blocks May 1 21:39:33 Tuerke kernel: md9: running, size: 3907018532 blocks May 1 21:39:33 Tuerke kernel: md10: running, size: 245117344 blocks Man, I can tell you that I'm really lost! Quote Link to comment
luvmich Posted June 28, 2014 Share Posted June 28, 2014 Had this problem once, back when I used unraid 4.7. Every parity check would result in a correction. Ran smart checks until I was blue in the face. Ran memory checks for hours. Even did file system checks what would find errors and correct them. Not sure if you have run them yet to see your results. This was all about two years ago that this happened. Since I rule out it was not a drive, I upgraded my hardware. I had a Sempron 140 in a consumer mobo with cheap controller cards. (bunch of two ports). I'm still using all the same drives today and I have not had a problem since I upgraded to my supermicro mobo. Knocking on wood:) Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.