March 18, 201115 yr I think today is the 3. time I experienced this, but for the first two occurence I thought it is only by accident. What I did: 1. run a parity check what finishes without error 2. powered down unRAID 3. then physically swapped two disks 4. Boot up unRAID 5. Swapped the disks in the array configuration on the device settings tab 6. Started the array 7. Started parity check I immediatelly got a single parity error at the very begining, then no more. As mentioned, it seems this issue at least for me is reproducable. Certainly it doesn't happen every day I want to swap drives physically, but would be good to know the reason. So is there an explanation for this or this sounds as a potential bug?
March 18, 201115 yr I think today is the 3. time I experienced this, but for the first two occurence I thought it is only by accident. What I did: 1. run a parity check what finishes without error 2. powered down unRAID 3. then physically swapped two disks 4. Boot up unRAID 5. Swapped the disks in the array configuration on the device settings tab 6. Started the array 7. Started parity check I immediatelly got a single parity error at the very begining, then no more. As mentioned, it seems this issue at least for me is reproducable. Certainly it doesn't happen every day I want to swap drives physically, but would be good to know the reason. So is there an explanation for this or this sounds as a potential bug? Not a bug with unRAID but probably something with your hardware. Have you run Memtest on the system. It is available as a boot option when you first start the machine. If you have not run memtest then I suggest doing so for at least overnight.
March 18, 201115 yr Author I can't see how this could be a memory issue. This is only happens in case I physically swap two disks which were already members of the array. Say I swap disk5 with disk9 first physically, then adjust unRAID settings as well. If there is no disk swap I can run as many partity check with 0 error as many I want. Normally I run parity checks every 3-4 weeks. With months long uptime (while running vmware with 2 VM, crashplan, transcoding (with Handbrake) in the background) I never have any parity error. Everything works perfectly. This only happens when I swap two disks. Do you think it's realistic to say this is a memory issue? Well, I am not 100% sure because it was months ago when I swapped multiple disks at the same time, but if I remember correctly I had multiple (maybe 3 or 4) parity errors at that time. Maybe one per disk? Anyhow, this is not a big issue, I can live with it. I just thought this worth a discussion.
March 21, 201115 yr Author If you were asking that nicely, your wish is my command Please see attached. No error during an overnight memtest.
March 21, 201115 yr This is a difficult problem. unRAID reconstructs entire disks when they fail in entirely or are upgraded. It's difficult to determine the cause of a parity check error. It could be any of the array drives with equal probability. Please post SMART reports for the array drives and an entire syslog. zip if needed.
March 21, 201115 yr Author I do really appreciate all the help, but I would like to understand what we're after here. I am using unRAID for almost 3 years now. I know how it is working. I am not a total linux noob either. I would like to point out again, that I don't have any parity check error, unless I am swapping disks on physical sata ports. Is it a valid scenario, that I run 10 parity checks, all are good, then after swapping two disks which resulting in a single parity error leads us to seek after the issue in the HW?
March 21, 201115 yr Silly question - but are you running a correcting parity check? The default was changed to "NOCORRECT" at some point, not sure when. If yes, the syslog will list the location of a parity sync error. Post the locations.
March 21, 201115 yr Author Yes, I am running a correcting parity check (right, I was not mentioning I am using 4.7 - I think defaulting to nocorrect has been introduced for 5.0 in one of the betas) Here is the log about the parity check: Mar 18 11:59:58 Tower kernel: mdcmd (26): check CORRECT Mar 18 11:59:58 Tower kernel: md: recovery thread woken up ... Mar 18 11:59:58 Tower kernel: md: recovery thread checking parity... Mar 18 11:59:58 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Mar 18 11:59:58 Tower kernel: md: parity incorrect: 29360 Mar 18 18:40:52 Tower kernel: mdcmd (27): spindown 3 Mar 18 18:40:53 Tower kernel: mdcmd (28): spindown 8 Mar 18 18:41:44 Tower kernel: mdcmd (29): spindown 10 Mar 18 18:53:49 Tower kernel: mdcmd (30): spindown 1 Mar 18 19:34:58 Tower kernel: mdcmd (31): spindown 10 Mar 18 20:32:48 Tower kernel: md: sync done. time=30765sec rate=63497K/sec Mar 18 20:32:48 Tower kernel: md: recovery thread sync completion status: 0 Looking at the last line I have a question: what the zero means there? Is it not the indicator of how many parity error has been corrected? But in this case it should be 1, shouldn't it?
March 21, 201115 yr Yes, I am running a correcting parity check (right, I was not mentioning I am using 4.7 - I think defaulting to nocorrect has been introduced for 5.0 in one of the betas) Here is the log about the parity check: Mar 18 11:59:58 Tower kernel: mdcmd (26): check CORRECT Mar 18 11:59:58 Tower kernel: md: recovery thread woken up ... Mar 18 11:59:58 Tower kernel: md: recovery thread checking parity... Mar 18 11:59:58 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Mar 18 11:59:58 Tower kernel: md: parity incorrect: 29360 Mar 18 18:40:52 Tower kernel: mdcmd (27): spindown 3 Mar 18 18:40:53 Tower kernel: mdcmd (28): spindown 8 Mar 18 18:41:44 Tower kernel: mdcmd (29): spindown 10 Mar 18 18:53:49 Tower kernel: mdcmd (30): spindown 1 Mar 18 19:34:58 Tower kernel: mdcmd (31): spindown 10 Mar 18 20:32:48 Tower kernel: md: sync done. time=30765sec rate=63497K/sec Mar 18 20:32:48 Tower kernel: md: recovery thread sync completion status: 0 Looking at the last line I have a question: what the zero means there? Is it not the indicator of how many parity error has been corrected? But in this case it should be 1, shouldn't it? Run another parity check until you see the sync error (you say it happens near the beginning). You don't have to wait for the parity check to complete - you can cancel it after you see the sync error. Then post the row that looks like this ... Mar 18 11:59:58 Tower kernel: md: parity incorrect: 29360 (Update: Do this several times and post the results. Lets see what parity blocks are being affected - if it is always the same, or always in a general region.) The other "0" is just the return code from the parity check. It didn't crash or anything. You might get a different status code when you cancel it. Either way, not significant to what we are looking at.
March 21, 201115 yr Author Hmmm... I know my English is not perfect , but what I am trying to say is that I will never see the parity error again unless I swap disks physically again. I ran a parity check already right after the above quoted error occured and the parity check completed without a single error. edit: on swap I mean swapping two already existing disks, say before the swapping: - disk1 connected to sata port A - disk2 connected to sata port B After swapping: - disk1 connected to sata port B - disk2 connected to stat poet A
March 22, 201115 yr Hmmm... I know my English is not perfect , but what I am trying to say is that I will never see the parity error again unless I swap disks physically again. I ran a parity check already right after the above quoted error occured and the parity check completed without a single error. edit: on swap I mean swapping two already existing disks, say before the swapping: - disk1 connected to sata port A - disk2 connected to sata port B After swapping: - disk1 connected to sata port B - disk2 connected to stat poet A This should not happen. Swapping disks around does not cause writes to the disk (the power if off, right?). I have moved drives around controllers dozens of times and never got a sync error from it.
Archived
This topic is now archived and is closed to further replies.