January 13, 200917 yr My drive started a parity sync due to a hang earlier tonight (I was playing with cpufrequtil). I'm 50% and have 3386 parity errors. They are going up at about 1/sec, sure and steady. I haven't sync'd since I was running 4.3.3 for about 3 weeks. I see nothing else going on. Nothing in /var/log/syslog. Is this normal?
January 13, 200917 yr 3 weeks ago you ran a parity check and there were no sync errors? Or three weeks ago you built parity for the first time and you have never run a parity check? (Please answer, this is an important question to allow us to help). I would check out the data on EACH disk as best you can. If the data looks good then it is likely that parity was never right. This happens sometimes with partially compatible hardware. But if you are having problems with one disk or another, it may be that a disk has gone bad and is spewing bad data back to unRAID. THIS IS VERY UNLIKELY. If it is happening, though, parity is getting broken rather than fixed by the parity check. I'm not sure stopping it at this late date would do you any good (the damage would already have largely been done). Tom has agreed to put a non-destructive parity check into the next build. This would allow you to check parity without parity updates. With this, if you noticed a bunch of sync errors and a disk was apparently bad, you could replace and rebuild it, and your parity would not have gotten compromised in the process. As I said, check out all your disks. If all appears well, let it finish. They run another parity check. If you are continuing to get parity errors you willl need to look into some boot options, because your motherboard / disk controllers are not working 100% with Slackware. If your next parity check is clean, I can't really explain what might have happened, but apparently your array has recovered.
January 13, 200917 yr Author 3 weeks ago you ran a parity check and there were no sync errors? Or three weeks ago you built parity for the first time and you have never run a parity check? (Please answer, this is an important question to allow us to help). It was the first complete parity check that I ran after I got both new 1TB drives installed on my new array running 4.3.3 with bubbaRAID. Since then I have loaded it up with files (mostly movies and music). I have played with running 4.4, 4.5b1, and 4.4.2 w/ bubbaRAID and my own kernel mods. Last night was the first time I had re-run a parity check since the first (essentially empty) one. I would check out the data on EACH disk as best you can. If the data looks good then it is likely that parity was never right. This happens sometimes with partially compatible hardware. I will try. Since it is mostly movies, music and some iso files, it is sort of difficult. But if you are having problems with one disk or another, it may be that a disk has gone bad and is spewing bad data back to unRAID. THIS IS VERY UNLIKELY. If it is happening, though, parity is getting broken rather than fixed by the parity check. I'm not sure stopping it at this late date would do you any good (the damage would already have largely been done). As I said, check out all your disks. If all appears well, let it finish. They run another parity check. If you are continuing to get parity errors you willl need to look into some boot options, because your motherboard / disk controllers are not working 100% with Slackware. If your next parity check is clean, I can't really explain what might have happened, but apparently your array has recovered. OK. It completed, and this morning I see it finished with 5,992 writes to the parity disk (the writes number was only a few higher than the sync err number) but there are 0 errors logged. That makes me feel a little better. Glad to hear about the non-destructive sync. That sounds like a good idea. Paul
January 13, 200917 yr OK. It completed, and this morning I see it finished with 5,992 writes to the parity disk (the writes number was only a few higher than the sync err number) but there are 0 errors logged. That makes me feel a little better. Glad to hear about the non-destructive sync. That sounds like a good idea. Paul Now, run another parity check, or even two in sequence If they come up with zero parity errors you are probably OK. If they shows ANY parity errors, suspect your memory (voltages, and or timing) or motherboard. Joe L.
January 13, 200917 yr The parity build is recorded by unRAID as a successful parity check (zero errors), but in truth it is not a check at all. It is a writing parity. In order to know your parity is good, you have to READ parity. (I wish unRAID did NOT show a parity build as a successful parity check!) It sounds like this was your first parity check, at least the first with your current suite of drives. This is a very basic statement, but the key to unRAID working is that it is able to read from and write to the disks in a repeatable and accurate way. If unRAID writes a 7, but the disk stores a 4; or if unRAID reads a 3 when the disk really has a 9 - you have a huge problem. Problems like this are rare, but bad or misconfigured memory, immature motherboard (chipset) drivers, BIOS issues, bad / loose cables, and failing drives can all cause this type of thing to a greater or lesser extent. (This is not an exhaustive list, bad PSUs, power cables, electrical interference, broken motherboard, etc. can also cause symilar symptoms.) With any of these types of problems, you can run into a situation where the parity build APPEARS to work, but when you run parity checks you get sync errors every time. (Parity checks are supposed to FIX sync errors by updating parity as they go, so even if you get some sync errors on one run, you should not get any sync errors the next time.) And the root cause has got to be determined and addressed. What seems odd to me, but yet I have seen several users swear up and down that this is happening, is that writes to the data disks work correctly, yet writes to the parity disk get corrupted somehow. This seems to most often happen with newer motherboards with less mature drivers. Several users had problems where adding one more disk caused problems (e.g, everything worked great with 3 disks, but adding a 4th caused parity to become unstable.) I don't know all the "why" answers, but I tend to steer users towards proven compatible motherboards and disk controllers where this type of weirdness doesn't happen. I am not saying you have any of these problems, just that you could. It is important that you are able to run a parity check and get zero errors. And that you are able to run another parity check and get zero errors. And that 2 days later you are able to run a parity check and get zero errors. And then a week later. And a month after that. Until you get to a point that unRAID demonstrates that its parity mechanism is working with your system, you are not protected. I would advise anyone reading this to re-run parity checks if they get sync errors. Even just one. Although a power outage or crashed server will cause sync errors, you should not be getting ANY if the server is being brought up and down smoothly. None. Nada. I'd suggest you run another parity check. If you get zero sync errors than you can begin to rest more easily that you are not having compatibility issues, but I'd still recommend running 2-3 more over the next several weeks to convince yourself that all is well. If you are continuing to get sync errors. post back. RobJ has created a wiki page with some boot options that have solved these types of issues which may help if the problem is due to an immature driver. Now, run another parity check, or even two in sequence If they come up with zero parity errors you are probably OK. If they shows ANY parity errors, suspect your memory (voltages, and or timing) or motherboard. Great minds think alike!
January 13, 200917 yr Just a point of reference... I've had my unRAID server since October 2005. It has been on-line in my home since then. I've had zero parity errors that I remember that I could not attribute to an unexpected power loss. (In the beginning I did not have a UPS) I've had 1 "read" error on a data disk. I do run monthly parity checks on my server to find and correct any marginal sectors on my disks. It is very rare to have a sector re-allocated, but they do happen. I have one older 250 Gig drive that has 100 reallocated sectors. That number has not changed in years. I suspect they were all in one spot on the disk and once re-allocated, the disk was fine. The only way to learn how your unraid system is doing is to monitor it. The only way you will ever know about disk problems is to periodically test them. The easiest "test" of the drive, to read all its sectors, is a "parity check" I'll be very happy once the parity "check" is available as a non-destructive test, and and alerts us to issues. Then we will be able to test further before "correcting" the error. That way, faulty memory chips, or voltages, or timings, do not corrupt our existing parity. Joe L.
January 13, 200917 yr Just a point of reference... I've had my unRAID server since October 2005. It has been on-line in my home since then. I've had zero parity errors that I remember that I could not attribute to an unexpected power loss. (In the beginning I did not have a UPS) I've had 1 "read" error on a data disk. I do run monthly parity checks on my server to find and correct any marginal sectors on my disks. It is very rare to have a sector re-allocated, but they do happen. I have one older 250 Gig drive that has 100 reallocated sectors. That number has not changed in years. I suspect they were all in one spot on the disk and once re-allocated, the disk was fine. The only way to learn how your unraid system is doing is to monitor it. The only way you will ever know about disk problems is to periodically test them. The easiest "test" of the drive, to read all its sectors, is a "parity check" I'll be very happy once the parity "check" is available as a non-destructive test, and and alerts us to issues. Then we will be able to test further before "correcting" the error. That way, faulty memory chips, or voltages, or timings, do not corrupt our existing parity. My experience has not been quite as good, but almost. I have had a few power outage situations and had resulting sync errors (I now have a UPS). After getting a significant number of sync errors, twice I had a few sync errors on the next parity check as well. But on subsequent parity check, sync errors went to zero and stayed there. (All of this was with 4.2.x version). I also had a server crash. The problem was traced to a bad SATA connection with the motherboard (i needed to use a locking cable to get a secure connection to that port). Once the cable problem was fixed, I got 33 sync errors on the automatic parity check. I have run 2 additional parity checks and gotten 0 sync errors each time.
January 13, 200917 yr Author OK, this is starting to bother me now. I have started to re-run the parity check in 4.4.2 w/ bubbaRAID, plain 4.4.2 and even back to 4.3.3. Each time I get the same thing. It counts approx 1 parity error/sec. This is the SAME hardware that ran a 0 error check on these same drives a few weeks ago. edit: Now I get what bjp999 was saying. Maybe I got 0 errors because it was the FIRST check (and maybe b/c it was on empty disks) I streamed some movies off of the drive and they seem to play fine. Same with my music files. There could be data errors, but I cannot find or confirm them right now. I have a hard time believing there is anything marginal about the drivers for this setup. It is a 5 year old mobo that has been running windows and linux just fine. (Having said that, I know that any H/W can go bad at any time for most any reason.) The drives are brand new 1TB Seagates running in SATAI (1.5 GB/s) mode on a Silicon Optics SATAI 2 port controller. I am willing to post logs, try other options, or whatever is recommended. Something changed and I don't get it. Paul
January 13, 200917 yr I suggest you VIGOROUSLY test your memory.... overnite with MEMTEST86+ latest version...
January 13, 200917 yr I suggest running a memory test (available as an option on the unRaid boot menu) to rule out memory problems. (Some high-end memory requires voltage boot to run properly. Double check your memory is properly configured). Some users have reported removing all but one stick of RAM has helped in situations like this. It is worth a try. Double check your cabling (SATA and Power connections). Try a fresh SATA cable to your parity disk. If none of this helps ... Post your hardware configuration (motherboard, memory, addon controllers (if any), drives, backplanes / docs, etc.). Boot 4.3.3 fresh, then start a parity check. Let it run for a few minutes and log a number of parity errors. Then you can stop the parity check. (4.3.3 logs more info about sync errors than 4.4.2). Capture and post this syslog. Post smartctl logs on each of your drives. Follow the Troubleshooting link in my sig for instructions.
January 13, 200917 yr OK, this is starting to bother me now. I have started to re-run the parity check in 4.4.2 w/ bubbaRAID, plain 4.4.2 and even back to 4.3.3. Each time I get the same thing. It counts approx 1 parity error/sec. This is the SAME hardware that ran a 0 error check on these same drives a few weeks ago. I streamed some movies off of the drive and they seem to play fine. Same with my music files. There could be data errors, but I cannot find or confirm them right now. I have a hard time believing there is anything marginal about the drivers for this setup. It is a 5 year old mobo that has been running windows and linux just fine. (Having said that, I know that any H/W can go bad at any time for most any reason.) The drives are brand new 1TB Seagates running in SATAI (1.5 GB/s) mode on a Silicon Optics SATAI 2 port controller. I am willing to post logs, try other options, or whatever is recommended. Something changed and I don't get it. Paul The version of linux you use would almost never affect the ability to read data off drives and compare bits to determine parity. What might affect parity: Memory. (Are you using the exact same memory timings, and memory voltages as several weeks ago. Is the case temperature (and memory temperature) exactly the same? Are you using the voltages and timing suggested by the memory manufacturer? Premium memory usually needs higher voltages. This is highly likely to be your issue. Noise on power supply lines. Have you added any hardware, an additional drive perhaps? Noise on data cables. Have any cables moved, or shifted position since when the array behaved itself? CPU voltage? Has it been changed? Power supply voltage. How well regulated is the supply you used? Noise pickup on motherboard. Did you mount it with ALL the screws to all the stand-offs as specified in the MB manual? A copy of the syslog might help, if there are errors showing up in it. It is unlikely to be a hard-disk, but it could be I suppose. The data has lots of checksums as it is moved to the drive controller card, so that is least likely. If your music plays correctly, and the videos play correctly, it is a problem reading the disks and calculating parity. This same problem would prevent successful recovery from a failed drive, so please do not ignore it. Check memory voltages AND timing first. Joe L.
January 13, 200917 yr Author Thanks for all of the feedback. I am running memtest86 v2.01 on the system as we speak. For what it's worth, I used to tweak this system to higher settings when running windows, and even then it ran just fine. Now I am back to the BIOS defaults. I am use the SPD settings from the memory (8-3-3-3 on DDR440 Crucial memory). I use all of the standoffs and I have reasonably thorough PC build practices. Since building and testing I have moved it behind my desk. The temps are the same or better than early on, and nothing that I consider out of range. I know that temps can approach the quasi religious type of discussion, so I won't list those at the moment. I will be happy to go through and check the cable routing and swap out the SATA cables. The power supply was new for this build as I kept my high end PSU for my primary rig. It should be an acceptable 450W PSU from Tiger direct. Otherwise the system is as simple as I could get it. A low power nVideo AGP graphics card and the 2 HDD's are the only other things in the case. I will follow the steps as listed. I'm going to let the memtest hit 100%, and then reboot back to collect info and logs. I will run memtest again overnight. FWIW when I ran tail -f on the syslog I was not seeing anything logged during the sync errors. However, I never did that on 4.3.3 (which has the better logging). Cheers, Paul P.S. Now to go make up with the wife as my frustration with this project has bled out in ways it shouldn't have.
January 13, 200917 yr P.S. Now to go make up with the wife as my frustration with this project has bled out in ways it shouldn't have. Good idea... It is probably "*way*" more expensive to get a new wife than a new motherboard. ;)
January 13, 200917 yr Author It is probably "*way*" more expensive to get a new wife than a new motherboard. ;) I can assert that theory as true. Once was enough. So, I have 1 pass with 0 errors and 60% through another one, still no memory errors. I really don't think it is the memory in this rig. Stopping memtest to collect some logs and such. Booting into 4.3.3.
January 13, 200917 yr Author Hmm..I fat fingered my previous post while trying to attach a file... Attached is the syslog from a 4.3.3 boot and then starting a parity check. It confirms the pattern I see of ~1err/sec during the check. I ran smartctl but this is what I get: root@Floater:~# smartctl -i /dev/sdb smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: ATA ST31000340AS Version: SD15 Serial number: 6QJ056N3 Device type: disk Local Time is: Tue Jan 13 13:20:44 2009 GMT+5 Device does not support SMART edit: BTW, I do get temps reported on the Disk Status section. Wouldn't that be a SMART param? I find that hard to believe on a brand new Seagate 1TB drive. I would think SMART reporting would work via this SATA controller, but maybe something is fishy. I know SMART used to work on PATA based drives on this mobo. Going to deconstruct and put in new SATA cables to see where that gets me....
January 13, 200917 yr Author Here is the output of smartctl for my 2 drives. This is run from 4.3.3 unRAID.
January 13, 200917 yr The FAQ entry for this is here: http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F As to temps not showing, the FAQ entry is here: http://lime-technology.com/wiki/index.php?title=FAQ#Why_is_a_temp_not_showing_for_a_drive.3F
January 13, 200917 yr Author Thanks RobJ. Running reiserfsck on /dev/md1 now. I guess it will take a while since it was ~70% full (or 700G) of data. Will post the results.
January 13, 200917 yr Author OK, so no corruptions found during a reiserfsck. I guess I'm on to swapping out SATA cables. Maybe I'll light some candles and bring in some chickens while I'm at it...
January 13, 200917 yr I had forgotten your syslog, so just took a look at it, and did note one big negative - it is either an nForce2 or nForce3 board. There is nothing otherwise obvious that is wrong with your system. Drive and cable problems would have produced errors in the syslog. But nForce2, nForce3, and nForce4 boards are notorious for data corruption issues, related to the early nForce chipsets. It does appear to be one of the repeating parity error cases, that have been the hardest to resolve. Once you have eliminated all other possibilities, I would have to make the motherboard itself a strong suspect, and no cure but to replace it. I'm sorry. Too many others (including me) have wasted too many hours trying to get these boards to work reliably. The data corruption issues seem much more likely to occur with simultaneous drive access, not single drive use, which is probably why most Windows users and gamers have had no problems with nForce boards. They do have very good performance and features.
January 13, 200917 yr Author grumble grumble...forum horking my posts while attaching files...grumble. I'll write it again... Here are 2 more recent logs after I re-cabled my HDD's with something other than the red SATA cables that come with mobos. I got them swapped at first, so both drives were red-balled. Swapped the cables back, it booted fine. Now my drives are sda and sdb, where before they were sdb and sdc for some reason. Now the info in syslog looks quite a bit different than before. I also see a few dubious numbers in the smartctl ouput, but OTOH I never ran this when I first installed the drives so I don't know what to compare against. Thanks for the support.
January 13, 200917 yr Hmmm ... your old logs look fine, your new ones look like a cabling problem. An unRAID parity check is a pretty stressful event on the computer. I had a slightly flaky cable that would work fine for normal drive access and running smart reports, but when I ran a parity check I immediately got errors in the syslog. The round IDE cables similarly work fine in Windows but create problems in unRAID servers. Just because a motherboard is good for Windows doesn't make it good for unRAID. RobJ has a ton of experience reading these logs and with the nForce-based motherboards. I fear he is likely right and that this motherboard is not a good choice for unRAID. I'd recommend sticking with one of the standard unRAID motherboards: - SuperMicro C2SEE (used by LimeTech in new preconfigured servers - Joe L. has this MB on a new server he is building) - ASUS P5B VM DO (no longer made so hard to find . Used by LimeTech in preconfigured servers until very recently - I have this MB) Be careful about reading the motherboard forums or hardware compatibility wiki. Having one user able to set up a 4 drive array that builds parity does not mean that it is REALLY fully compatible. I can't overstate this enough. Get a standard motherboard and you'll have a much better unRAID experience IMO.
January 13, 200917 yr The only problem I see in the SMART reports is the large increase in the UDMA_CRC_Error_Count on one drive, and that usually means a bad cable. Makes sense since that is what you were swapping. The cable issues produced numerous and repeated errors in the syslog, which severely impacted parity check performance. Edit: Brian beat me! I concur with what he said.
January 13, 200917 yr Author Alas, I have accepted this fact. All my work (and it probably was a lot of work because of this board) to build a server based on my old H/W has gotten me to the Newegg website. Based on this post: http://lime-technology.com/forum/index.php?topic=2642.0 That seems like a pretty safe bet. The cost is reasonable and multiple people have proven it out. Hey, it gives me an excuse to upgrade the memory on my main windoze rig, since there is a special on some fast OCZ ram sticks. I'll take my 800's and put in them in the unRAID box and then go for the 1066's in my gamer machine. Yeah, this has been a trying experience, but I have learned a ton of stuff in the process. The support here has been great. I'll keep limping along with my nforce2 board for a few days, and then slap the drives onto the new board later this week. Is there any reason to believe that I will lose my data by shifting the drives to a new mobo?!?!?
January 13, 200917 yr Author The only problem I see in the SMART reports is the large increase in the UDMA_CRC_Error_Count on one drive, and that usually means a bad cable. Makes sense since that is what you were swapping. The cable issues produced numerous and repeated errors in the syslog, which severely impacted parity check performance. Edit: Brian beat me! I concur with what he said. So there's a logic problem. By trying to disprove a theory I created the problem I was trying to disprove. Does that prove that it wasn't previously a problem? End story, I switched back to my original cables... (and I ordered a new mobo/cpu from newegg)
Archived
This topic is now archived and is closed to further replies.