February 2, 201313 yr I am running Unraid 4.7 and have the monthly parity check package installed. As recommended, I'm using the NOCORRECT parameter. Last night it ran the monthly check and it reported 3 errors. After satisfying myself that there weren't any problems with my data disks, I followed the instructions on the Package manager page and started a second parity check using the Check button on the main unRaid menu page, to run a correcting parity check and correct the errors. It appeared to do so regarding the original parity errors but reported 1 parity error this time (at a different location based on the syslog (attached)) I was confused why it was reporting 1 error and didn't know whether it was telling me that it had found 1 new error and had corrected it or not, so I started a third parity check to confirm that all was now well. But, midway through this third check it is reporting the same parity error at the same location as it did in the second parity check. This third check is still ongoing I'm not sure how to proceed to re-establish a clean parity-error-free array. Can someone provide guidance? Thank you syslog-20130201.txt
February 3, 201313 yr Author Examine all SMART reports for current_pending_sector RAW_VALUE greater than 0. dgashk, That was the first thing I did and there were no problems either with Reallocated Event Count or with Current Pending Sectors. Raw count on all disks were zero. The problem corrected itself on the fourth parity check but it is the sequence of events that I don't understand. I'll present an edited syslog to demonstrate what is weird to me. On February 1st at midnight the system begins it's monthly parity check (NOCORRECT) as usual. Feb 1 00:00:01 Tower kernel: mdcmd (45): check NOCORRECT (unRAID engine) Feb 1 00:00:01 Tower kernel: (Routine)Feb 1 00:00:01 Tower kernel: md: recovery thread woken up ... (unRAID engine) Feb 1 00:00:01 Tower kernel: md: recovery thread checking parity... (unRAID engine) Feb 1 00:00:01 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) Feb 1 01:04:44 Tower kernel: md: parity incorrect: 796939200 (Errors) Feb 1 02:41:33 Tower kernel: md: parity incorrect: 1742433096 (Errors) Feb 1 02:41:33 Tower kernel: md: parity incorrect: 1742433800 (Errors) Feb 1 07:00:00 Tower kernel: md: sync done. time=25201sec rate=77517K/sec (unRAID engine) Feb 1 07:00:00 Tower kernel: md: recovery thread sync completion status: 0 (unRAID engine) At 7 AM that parity check finishes having found three errors. After checking all the disk's smart reports I start a second parity check (this time using the Check button on the main screen) Feb 1 08:22:02 Tower kernel: mdcmd (57): check CORRECT (unRAID engine) Feb 1 08:22:02 Tower kernel: md: recovery thread woken up ... (unRAID engine) Feb 1 08:22:02 Tower kernel: md: recovery thread checking parity... (unRAID engine) Feb 1 08:22:02 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) Feb 1 10:35:57 Tower kernel: md: parity incorrect: 1592017448 (Errors) Feb 1 15:08:06 Tower kernel: md: sync done. time=24365sec rate=80177K/sec (unRAID engine) Feb 1 15:08:06 Tower kernel: md: recovery thread sync completion status: 0 (unRAID engine) This parity check apparently corrects the three original errors as it should, but finds one new error that it reports on the main screen as having found 1 error. I was unsure whether it was saying that it found one new error and corrected it or that there remained one new error. I assumed it had corrected it but to be sure I started a third parity check, assuming it would report no errors this time. Feb 1 15:59:03 Tower kernel: mdcmd (58): check CORRECT (unRAID engine) Feb 1 15:59:03 Tower kernel: md: recovery thread woken up ... (unRAID engine) Feb 1 15:59:03 Tower kernel: md: recovery thread checking parity... (unRAID engine) Feb 1 15:59:03 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) Feb 1 18:12:56 Tower kernel: md: parity incorrect: 1592017448 (Errors) Feb 1 22:45:35 Tower kernel: md: sync done. time=24395sec rate=80078K/sec (unRAID engine) Feb 1 22:45:35 Tower kernel: md: recovery thread sync completion status: 0 (unRAID engine) Note that this third check reported the SAME parity error at the SAME location as the second check and reported that it had completed the check having found one error. This was the point at which I was really confused and posted the original message. Being unable to figure out what to do, I decided there was nothing to lose by now doing a fourth parity check. Feb 1 23:18:48 Tower kernel: mdcmd (59): check CORRECT (unRAID engine) Feb 1 23:18:48 Tower kernel: md: recovery thread woken up ... (unRAID engine) Feb 1 23:18:48 Tower kernel: md: recovery thread checking parity... (unRAID engine) Feb 1 23:18:48 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) Feb 2 06:05:03 Tower kernel: md: sync done. time=24376sec rate=80140K/sec (unRAID engine) Feb 2 06:05:03 Tower kernel: md: recovery thread sync completion status: 0 (unRAID engine) This fourth time the parity check completes without finding any errors. I'm unsure whether the third check corrected the error after reporting it or whether the fourth check corrected the error. At any event the problem is now corrected, but I am at a loss why this sequence of events happened. At the very least I should not have had two successive parity checks with the (CORRECT) option both report a parity error at the same location. It is a mystery to me and would love an explanation if someone can provide one.
February 4, 201313 yr My best guess... The fist NOCORRECT parity check found three addresses with errors. These are EITHER flaky memory on the MB, or flaky electronics/memory in one of your disk drives... if you had run a second NOCORRECT check, it probably would have given different addresses. The next CORRECT check only found a single address "bad" (again, I suspect flaky memory, or a flaky disk) This time parity was changed to make it match the bad data it had read. It did not find the three original addresses in error. The next CORRECT check read everything perfectly, however, since you had updated 1592017448 in the prior CORRECT to reflect the "random" bad data you had read, this CORRECT check had to change it again, this time to make it back to its original value and really be "correct" It did not find the original three addresses in error. The third CORRECT check found nothing wrong, so nothing to correct. (this time) You have either a flaky disk, or flaky memory, or a power supply which has a noisy supply voltage that affects the electronics/memory in one of the disk drives, or a flaky disk controller port. You will continue to see random parity errors at random addresses. (use the NOCORRECT check, that way you don't update parity. If you see the exact same blocks every time, then you can use the CORRECT option.) This class of error is VERY difficult to locate. Two of your checks returned a random error. The error rate is very tiny, 4 blocks out of 1953514552, but you'll never be able to trust the array to return your most precious file until you resolve this. (odds are in your favor, but you do not want to play the odds) About the only thing you can probably rule out is the "mouse" I'd start with a memory test, preferably overnight. It is the easiest to test and most likely to be the issue. Memory must be configured with the proper voltage, clock speed, and timing. Most BIOS get it right, some do not and you have to set the parameters for your SPECIFIC MAKE/MODEL ram strips. if you do get errors, check the BIOS settings first. Your memory might need slightly higher voltage, or less aggressive timing. Joe L.
February 5, 201313 yr Author Joe, Thanks for the detailed response, though this is a distrubing analysis. I've started a memory test. The first pass was fine but I'll let it run overnight. I've never made ANY adjustments to timing, overclocking or anything like that. I'm far too ignorant of these issues to have ever tried it. I just used approved memory for my motherboard and accepted all the BIOS defaults. Of course that doesn't mean something isn't wrong with the memory, but just to state that I never fiddled with any of that. The only other thing I can conceivably highlight is that on one of my disks there was a report of a "multi-zone error rate" with a raw value of 2. I couldn't find much useful info on this variable but it was not a new report. That disk has reported that value for some time. I don't know if that provides any additional insight to you, but it is the only other thing that is reporting anything out of the ordinary. Fortunately there is no mission critical data on the server. It's all just a big media server, but it's stuff I value nonetheless, plus the prospect of getting random errors each month during parity checks is not particularly attractive. I'll report back tomorrow after the memory test runs overnight. Thanks.
February 5, 201313 yr A couple years ago I had a similar sounding problem. See my posts in this topic: http://lime-technology.com/forum/index.php?topic=11515.0 If this problem does not correct itself (or if you find that it is not bad SATA or power cables, or a weak power supply or bad RAM or a bad SATA port on the motherboard) and you continue to see a sequence of correcting parity checks running ok for a few times and then fixing the same block twice in a row, then you might well be seeing the same issue I was. If this is the case then the problem is with one of your disks, but the problem is now that the block number from the error number is just a block in the array and not for one particular disk. So you need now to test that block (or more likely some range of blocks) on each of your disks to determine which disk is misbehaving. To do this you will need to use the potentially dangerous "dd" command and the md5sum command, take a look at this posting: http://lime-technology.com/forum/index.php?topic=10364.msg98580#msg98580 the idea is to use "dd" to read the raw data from the same region (which contains the error block) on each of your drives, then record the md5sum (hash) of the data, then rerun the command a number of times to see if you ever get a different md5sum for one of the drives. This should never happen, but in my case one of my seven drives would occasionally return a different value - even though the drive had not been written to. Good Luck, Stephen
February 5, 201313 yr You could generate hashes for all your files using md5sum and then check them a few times, if there is some memory error you might find non matching hashes in random files..... if you keep getting errors on the same disk you might isolate a cable or controller error.... of course you might spend many hours generating hashes and checking them and find nothing too and File hashes wont help if there is some problem with the parity disk...
February 6, 201313 yr Author Well the bad news is I let memtest run overnight and it showed no errors. Bad news because that would have beena very simple fix - new RAM. vca, Chris Pollard, I'm gonna have to do some learning to even understand what you guys are talking about. I know next to nothing about low level disk issues md5 and am a rank novice from the point of view of the Linus command line. It seems to me that the first thing I want to do is run a few parity checks with the NOCORRECT option to see if this problem is going to repeat and if so how often. My question: how do I tell unraid to run a parity check with the NOCORRECT option set. The monthly parity check does that but only when it runs, and the Check button on the main screen of unmenu runs a check with the CORRECT option set. Do I just run it from a command line, and if so, what is the command and the parameters?
February 6, 201313 yr Do I just run it from a command line, and if so, what is the command and the parameters? from the command line: /root/mdcmd check NOCORRECT It must be capitalized as shown.
February 6, 201313 yr Keep a log of the results of your parity check runs, especially the lines from the log file like: Feb 1 01:04:44 Tower kernel: md: parity incorrect: 796939200 (Errors) Feb 1 02:41:33 Tower kernel: md: parity incorrect: 1742433096 (Errors) Feb 1 02:41:33 Tower kernel: md: parity incorrect: 1742433800 (Errors) as you may need these block numbers to configure the "dd" commands if you decide to do that sort of testing later. Regards, Stephen
February 6, 201313 yr from the command line: /root/mdcmd check NOCORRECT Is there a means for limiting this "check" to a subset (range of sectors / blocks / stripes, etc) of the array?
February 6, 201313 yr Author Keep a log of the results of your parity check runs, especially the lines from the log file like: Feb 1 01:04:44 Tower kernel: md: parity incorrect: 796939200 (Errors) Feb 1 02:41:33 Tower kernel: md: parity incorrect: 1742433096 (Errors) Feb 1 02:41:33 Tower kernel: md: parity incorrect: 1742433800 (Errors) as you may need these block numbers to configure the "dd" commands if you decide to do that sort of testing later. Regards, Stephen Thanks. I'll make sure I do that. As for what has happened since my last post: I ran two NOCORRECT parity checks. The first one reported a single parity error at a location unrelated to any of the prior errors. Feb 5 18:44:43 Tower kernel: md: parity incorrect: 1201804360 (Errors) I then ran a second parity check. This time it reported no errors. So as you have all indicated, I'm clearly in murky territory. Trying to think about this, I'm drawn to the fact that I added a data disk during January. Thus, the February 1st monthly parity check was the first one that exhibited this problem and was the first one that included the new data disk. That's enough of a coincidence to focus on that drive as a first step. I'm running a Norco 4224 and this drive is on a SASLP-MV8 Supermicro card. It's also true that the slot this drive is in was unused before, so it could also be the Norco tray connector, the SAS cable or the Supermicro card (though both the cable and the card has supported other drives before this month.) Looking for some feedback here, but, particularly since I have a hot spare ready to go, a sensible strategy might be to replace this drive with the hot spare, rebuild the array and run a few new parity checks. Make sense? Assuming it does, what is the easiest way to do this? Do I just stop the array, unassign the current drive, assign the hot spare to the same device slot, restart the array and start the rebuild process, or is there some step I'm missiing? Will wait for feedback before starting this process. Thanks.
February 7, 201313 yr from the command line: /root/mdcmd check NOCORRECT Is there a means for limiting this "check" to a subset (range of sectors / blocks / stripes, etc) of the array? Not that I know of, though if your problem is near the start you can just stop the parity check by hand when it has gone far enough. Stephen
February 7, 201313 yr I ran two NOCORRECT parity checks. The first one reported a single parity error at a location unrelated to any of the prior errors. Feb 5 18:44:43 Tower kernel: md: parity incorrect: 1201804360 (Errors) I then ran a second parity check. This time it reported no errors. So as you have all indicated, I'm clearly in murky territory. Trying to think about this, I'm drawn to the fact that I added a data disk during January. Thus, the February 1st monthly parity check was the first one that exhibited this problem and was the first one that included the new data disk. That's enough of a coincidence to focus on that drive as a first step. I'm running a Norco 4224 and this drive is on a SASLP-MV8 Supermicro card. It's also true that the slot this drive is in was unused before, so it could also be the Norco tray connector, the SAS cable or the Supermicro card (though both the cable and the card has supported other drives before this month.) Looking for some feedback here, but, particularly since I have a hot spare ready to go, a sensible strategy might be to replace this drive with the hot spare, rebuild the array and run a few new parity checks. Make sense? Yes that sounds like a reasonable approach, when I was going through this grief, once I had identified the disk in question that is exactly what I did and after a few parity checks without further issues I had a good idea that I had solved the issue. Then I did some more testing of the bad drive outside of the array to further confirm the hypothesis. Assuming it does, what is the easiest way to do this? Do I just stop the array, unassign the current drive, assign the hot spare to the same device slot, restart the array and start the rebuild process, or is there some step I'm missiing? Will wait for feedback before starting this process. Thanks. That sounds correct, but I'm just going from memory. Check the wiki, the procedure is in there. Regards, Stephen
February 7, 201313 yr Author Yes that sounds like a reasonable approach, when I was going through this grief, once I had identified the disk in question that is exactly what I did and after a few parity checks without further issues I had a good idea that I had solved the issue. Then I did some more testing of the bad drive outside of the array to further confirm the hypothesis. That's what I was thinking. If this DOES solve the problem, I'll have the problem disk outside the array and able to be thoroughly checked out on my PC where I'm much more comfortable operating. I'll report back and probably seek advice on how to check out the disk (what tools to use etc) assuming this shows any indication of solving the problem. Hard to know with intermittent problems of course. Thanks again.
February 9, 201313 yr Author As discussed, I replaced the data disk that was added to the array during the month of January and rebuilt the array with a hot spare on the hunch that maybe that disk was the source of the apparently intermittent parity error I'm getting. The rebuild was successful. But it doesn't seem to have solved the problem. After the rebuild finished I ran a NOCORRECT parity check. It found a single error. Feb 7 20:51:29 Tower kernel: mdcmd (28): check NOCORRECT (unRAID engine) Feb 7 20:51:29 Tower kernel: (Routine) Feb 7 20:51:29 Tower kernel: md: recovery thread woken up ... (unRAID engine) Feb 7 20:51:29 Tower kernel: md: recovery thread checking parity... (unRAID engine) Feb 7 20:51:29 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) Feb 8 00:45:54 Tower kernel: md: parity incorrect: 2728833160 (Errors) Feb 8 03:29:58 Tower kernel: md: sync done. time=23911sec rate=81699K/sec (unRAID engine) Feb 8 03:29:58 Tower kernel: md: recovery thread sync completion status: 0 (unRAID engine) I then ran a second NOCORRECT parity check and it completed successfully without error. So I'm out of easy options. I've already fiddled with the cables, cards etc (before I rebuilt the array) and found no obvious problems. I can start replacing cables, cards or power supply but that could get expensive, and I think Stephen's suggestions regarding ways to check individual disks makes sense at this point. If I understand any of this correctly I should probably first do a number of additional parity checks to see if I can narrow down the offending blocks that might be causing the problem and then run scripts based on the md5 and dd commands. I plan to do a number of parity checks over the next week (maybe one each night) to see if I can narrow this down, and then proceed to use these tests to see if I can find what's going on. Stephen, would it be alright with you if I PM you regarding the results I get from these parity check and to develop a strategy to test the disks in the array. Frankly, given my very limited linux skills I'm not at all confident that I won't screw things up without guidance. Thanks.
March 2, 201313 yr Author I wanted to report back to the forum regarding the problem that started this thread. Namely, non-repeating intermittent parity errors. (the details of the problem are in the earlier posts of this thread). In an effort to keep this short, I had a series of non-repeating parity errors, that Joe L. soon informed me was due to some sort of intermittent problem that would be very hard to diagnose. After trying a memory check overnight and replacing and rebuilding a recently added drive (both of which yielded nothing) I decide to take Stephen (vca)'s advice and start using the dd command to investigate the problem. I asked him at that time if we could take the conversation offline via pm's so I could get some guidance on how best to go about it. Sparing all the details and detours along the way, I wanted to report back to the forum that the problem has been addressed and with what was discovered about both the problem and, at least potentially, with the use of the dd command. First, the problem turned out to be another case of a SIL3132 card intermittently malfunctioning. (This thread has much more on potential problems with this chip/card http://lime-technology.com/forum/index.php?topic=21052.0 ) In my case it was a 2-port PCI-e x1 card, purchased from Monoprice. Only one of the ports on the card was in use. It was eventually diagnosed as the problem by doing three successive dd commands of the following structure (as suggested by Stephen) ( dd if=/dev/sdj | md5sum -b >> sdj.log). Each run returned a checksum and in each case the checksum was different. After putting the drive attached to this port in a slot that wasn't controlled by the SIL3132 card three successive NOCORRECT parity checks completed with no reported errors. So I feel relatively confident in reporting the SIL3132 is the source of the problem. Of course there was a lot of back and forth and long waits involved before coming to this determination. And that leads to the other "finding" that is worth reporting back to the forum. The dd command on an entire 2 TB disk seems to take about 7 hours (at least on my system). Doing that sequentially (and at least twice in order to get two checksums to compare) on each of 11 drives is a long long process. And when a dd command is executed you do not get the prompt back until the command completes so you cannot (or at least I - with my very limited Linux skills - could not) just run multiple dd commands at once. Stephen suggested using screen to run several dd commands at once. I ran three screen sessions simultaneously and one dd command - each on a different disk - in each screen instance. That way I could complete 3 disks in one seven hour session. When one dd command completed I would enter the next dd command in that screen. After getting a single checksum for each disk, I started a second run through in order to get comparisons. I discovered that ALL of the disks were returning inconsistent checksums. It was clear that the dd commands were somehow stepping on each other. How and why that was happening is way above my level to suggest, but, at least in my case, running multiple dd commands simultaneously yielded garbage, which was a surprise, and may be instructive for the next person to have to deal with such intermittent problems. I then proceeded to run dd commands sequentially, one at a time. I ran two consecutively on each drive, then started on another drive. Luckily, the drive attached to the SIL3132 card was the fourth one I tried, which is where the problem was discovered and corrected. Thanks to Stephen(vca) for helping me through this process. Harry
Archived
This topic is now archived and is closed to further replies.