April 12, 200917 yr (I'm running 4.5.b4... my apologies if there's a better spot for beta questions) I ran a parity check after moving all the hardware, including drives into a new case, squirreling them in the back of the equipment closet, then digging them back out to switch the ATA drives around so they matched unRAID's expectations. I had moved many gigs of ripped .flac files onto them over the past week with no obvious problems and no parity errors; I've been running parity checks practically nightly since I'm using a nForce2 board and wanted to stay on top of any issues that might cause. I moved a 5GB+ .mp4 over this afternoon and ran a parity check again. When I refreshed the page with 0.8% of the check complete, there were 635 sync errors. It's done now and no more errors were found. Basically, I'd like to know the correct amount of panic I should be experiencing. Given my understanding of sync errors, "not much" seems appropriate, but I thought I'd ask for more learned opinions. Is there anything else I should do other than move run parity check again and/or move some more large files around? Thanks! pastebin of syslog
April 12, 200917 yr There are a couple of things of note in your v4.5-beta4 syslog, but they are not necessarily a problem. One is that there are NO sync errors reported, although you reported 635. They usually all appear in the syslog, and I can't explain why they aren't there, unless Tom turned off their logging. The other is a comment on the setting of NCQ queue_depth of all drives to 1. All 3 of your drives were apparently set to 1, even though one is an IDE drive (hdb), that cannot support NCQ ever (so this is probably harmless), and the other 2 are on controllers that do not support NCQ, so their queue_depth was 0, but now is set to 1. Seems simplistic to say that 0 is not the same as 1, but there really may be a difference, and it would be good to receive some assurance that setting a queue_depth to 1 when the kernel would not set it higher than 0, is actually OK and harmless. Getting even 1 parity error is unusual, so some response is expected, although perhaps not as far as 'panic'. I would run another parity check, to see if you get the same 635, which would indicate that the system is re-correcting those 'sync errors', changing them back to the previous values. Was this the first parity check since loading v4.5-beta4?
April 12, 200917 yr Author Was this the first parity check since loading v4.5-beta4? Yes, it was. This is what //Dingo:8080/unraid_main shows (among the other usual things): Check will start a Parity-Check. (Last checked on 4/11/2009 9:41:15 PM, finding 635 errors.) And although I thought I had started a new parity check last night, apparently I forgot in the frenzy to get ready for the Easter Bunny. So, it's running now. Thanks for the advice!
April 12, 200917 yr Author Followup parity check found three errors, although there doesn't seem to be anything in the logs again. The errors came early on, although it was 5.0% done before I got back to reload the page. pastebin of syslog
April 12, 200917 yr Based on what you have said, I would not trust your motherboard to keep accurate parity. (also based on other's experience with early nforce chipsets inability to perform consistent parity calculations) No "subsequent" parity check should have errors unless you powered down without stopping the array, and did not give it the chance it needed to stop cleanly.. The fact that you have one good parity check every once in a while is simply not good enough for me. I'd look for a replacement motherboard that uses the same CPU and memory. It will be cheep insurance in the long run. Joe L.
September 11, 200916 yr Is Joe's comment in relation to the SATA headers? Or something else on the motherboard? Eg. if you were to simply add a good SATA card, would that fix the problem, rather than replacing the whole motherboard?
September 11, 200916 yr Is Joe's comment in relation to the SATA headers? Or something else on the motherboard? Eg. if you were to simply add a good SATA card, would that fix the problem, rather than replacing the whole motherboard? Good possibility it might work with a different SATA controller, and if it did it would be less work than changing out the entire motherboard... but you would need to test by performing a few parity checks in a row... Joe L.
September 11, 200916 yr Cool -- thanks. I'm still learning all this stuff, but I'm indeed getting Parity errors every time I run the sync. It's about 105 errors per sync, and even immediately after running a sync, if I run it again, I see more errors... So I'm guessing that MIGHT mean my controller... but am I right in that assumption? How many parity errors would it take to really be a problem? Obviously 0 would be best... but is 100 going to affect my ability to restore a failed drive? Thanks!! Robbie
September 11, 200916 yr If parity errors are showing up on every rysn then you CAN'T trust the data that is on the parity drive at all. If you were to try and restore using the parity drive at this point you would lose some of the files as the info on the parity drive may be incorrect/invalid. I assume the errors are showing up on the parity drive itself and not one of the data drives. On the web GUI there is a column that lists errors for each drive, just check to make sure it is the parity one that the errors are coming from.
September 11, 200916 yr Thanks for the tip! I'm in the middle of a Parity Check right now, and it shows 2 sync errors so far, but all three of my drives have "0" in the Errors column. Might be just because the check is still happening? Regardless, if it ends up being my Parity drive... do I replace it? Do I move it to a different controller? What's the best course of action in that case? Is it easy enough to move all drives onto a different controller without losing my array? Thanks so much!!
September 11, 200916 yr Cool -- thanks. I'm still learning all this stuff, but I'm indeed getting Parity errors every time I run the sync. It's about 105 errors per sync, and even immediately after running a sync, if I run it again, I see more errors... So I'm guessing that MIGHT mean my controller... but am I right in that assumption? How many parity errors would it take to really be a problem? Obviously 0 would be best... but is 100 going to affect my ability to restore a failed drive? Thanks!! Robbie Seeing any errors is bad. There should never be any unless the array was improperly shut down, not are ever expected. As far as the 100 errors... Think of randomly changing 100 of the bits on your replacement disk to a value other than what they originally held. It may have a drastic effect if a piece of code in a much loved program... or it corrupts a ziip file checksum and prevents it from being un-zipped... or is might be somewhere in the file-system causing a file-system corruption that prevents it from being mounted... or in unused space... and have no visible effect at all. I've seen one sync error... ever.... on my array, and it was before I had it on a UPS and it lost power in a power failure. My array has been in use 24/7 since October 2005. I've run monthly parity checks to make sure everything is running properly... plus many more as I tested various features... Motherboards are cheap compared to your time... If you suspect the MB, and ruled out the other hardware, plan on a replacement MB. Are you stopping the array prior to powering down? If not, that might be the cause of the parity errors... The fact that the parity errors show up at the start of the parity sync process points to their involvement in housekeeping portion of the reiser file-system. They could corrupt a attempt to restore any of the other file-systems... or they could be as simple as a superblock time-stamp... impossible to know. If you are stopping the array, then powering down, and still seeing parity errors, there is something wrong with the hardware... Early N-force chipsets had problems. This has shown itself several times as we helped diagnose random parity errors. Their owners ended up replacing the MB. Joe L.
September 11, 200916 yr The sync errors means that it found a discrepancy between the parity info and what it calculated this time through the disks. When these keep happening and you have not changed any data then something is probably wrong with your setup. We have seen memory timing be the cause (check the BIOS to make sure the board got it right), and bad controllers on the board itself, among a few other things. The Errors column for the drives is for Read and Write errors, so the fact that nothing is popping up there is good. It seems like your drives are OK, but running a smart long test on the drives would probably be a good thing to do, to make sure they are good and get a baseline of how the drives are. Yes, you can move the array to a new motherboard entirely and not have a problem restarting the array. Just take a screen shot of what drives are assigned to what positions in the array and assign them back to the same spot when you set the array back up on a new motherboard/controller. The one you want to make sure to get in the correct position is the parity drive; if you assign a data drive to the parity slot you will wipe out all data that is on the data drive.
September 11, 200916 yr Thanks for the tip! I'm in the middle of a Parity Check right now, and it shows 2 sync errors so far, but all three of my drives have "0" in the Errors column. The "errors" column affiliated with specific drives are "read" errors. If a read-error occurs, the unRAID software will reconstruct what it intended to read from the drive by using parity and the remaining data drives and then also write the same data back to the drive that failed. If it was actually a bad sector on the disk, the SMART firmware on the disk should then re-write and/or re-allocate that sector. By the time you see the "read" error, it has been handled. These are bad, but may occur over time as a drive is unable to read a sector. I've seen about 1 or 2 of these in 4 years. a "write" error to a drive will take it out of service. It will show up with a "red" indicator on the management console. Might be just because the check is still happening? Not related... different type of error. Regardless, if it ends up being my Parity drive... do I replace it? Do I move it to a different controller? What's the best course of action in that case? Is it easy enough to move all drives onto a different controller without losing my array? Very easy to move it. Just stop array, power down, physically move the drive, power up and start the array. If you moved the drive to a new controller it might need to be assigned on the devices page first, then start the array. unRAID makes it easy to migrate to new equipment/controllers. Best to take a screen-shot of the Device assignment page so you know how to re-assign the drives when you put them on new hardware. As long as you do not accidentally assign a data drive as the parity drive, you'll be fine. As long as all the drives are in place, there is a special "trust my parity" process described in the wiki. In your case, you will probably want to let it completely check parity... and then check it again. (The first pass may fix the errors, then the second not find any) Thanks so much!! You are welcome. One last thing. If the hardware is not recognized, it might not let you start the array unless you save a new configuration using the button labeled "restore" That "restore" button has NOTHING to do with data restoration, but instead deletes the existing configuration superblock file, creates a new one based on the currently assigned and working drives, and then starts a new parity calc based on the new configuration. If you have a failed drive, DO NOT USE THE RESTORE BUTTON!!!! as it will erase any knowledge of the data on the failed drive. Always use the "Start" button to start the array, even when replacing a failed drive, or upgrading a drive... (There is an exception to this rule, but in general, unless you are removing a drive from the array and NOT replacing it immediately, pretend the "restore" button does not exist) The exception I mentioned is if you replaced your hardware, and have ALL the same data drives installed as when you last calculated parity, and the array will not start because they are on different hardware controllers. You can then use the "Trust-my-parity" procedure as described in the wiki. Joe L.
September 11, 200916 yr but running a smart long test on the drives would probably be a good thing to do, to make sure they are good and get a baseline of how the drives are. Good advice, but the version of unRAID Robbie is running is missing the support library needed for the smartctl program as described here: http://lime-technology.com/forum/index.php?topic=2817.msg23548#msg23548 You'll need to install the support library before you can run smartctl. the base unRAID software does not use smartctl, so it (the missing library) has no effect other than your ability to run the smart reports on your drives. Instructions on how to run the smart tests are here: http://lime-technology.com/wiki/index.php/Troubleshooting
September 11, 200916 yr As far as the 100 errors... Think of randomly changing 100 of the bits on your replacement disk to a value other than what they originally held. It may have a drastic effect if a piece of code in a much loved program... or it corrupts a ziip file checksum and prevents it from being un-zipped... or is might be somewhere in the file-system causing a file-system corruption that prevents it from being mounted... or in unused space... and have no visible effect at all. This is the Achilles heal of unRAID IMO. When you do a drive reconstruction there is no way to verify that everything is perfectly restored at the end. With something obvious like this, where the hardware is broken or incompatible, a parity check will produce sync errors on each parity check indicating that a future reconstruction will not perfectly restore your data. But parity can be fouled by crashes that surround true disk failures too. After a nasty failure and ensuing drive rebuild, it is not that uncommon to have someone ask if it rebuilt perfectly - and the answer is an unsatisfying "probably". Without access to a backup to compare with, a checksum of some type (e.g., md5), or a par2 set, knowing is not possible. One thing we veterans do is run the monthly parity check to rule out parity errors creeping into the system, and to use a UPS to help keep clean power going to the server and avoiding crashes after a power blip. It would be smart if someone created a script to run a periodic md5 calculation on each drive periodically, with an ability to compare that with the current data on a disk. Although it would not help recover damanged files, at least it would identify files that have been corrupted.
September 11, 200916 yr As far as the 100 errors... Think of randomly changing 100 of the bits on your replacement disk to a value other than what they originally held. It may have a drastic effect if a piece of code in a much loved program... or it corrupts a ziip file checksum and prevents it from being un-zipped... or is might be somewhere in the file-system causing a file-system corruption that prevents it from being mounted... or in unused space... and have no visible effect at all. This is the Achilles heal of unRAID IMO. When you do a drive reconstruction there is no way to verify that everything is perfectly restored at the end. With something obvious like this, where the hardware is broken or incompatible, a parity check will produce sync errors on each parity check indicating that a future reconstruction will not perfectly restore your data. But parity can be fouled by crashes that surround true disk failures too. After a nasty failure and ensuing drive rebuild, it is not that uncommon to have someone ask if it rebuilt perfectly - and the answer is an unsatisfying "probably". Without access to a backup to compare with, a checksum of some type (e.g., md5), or a par2 set, knowing is not possible. One thing we veterans do is run the monthly parity check to rule out parity errors creeping into the system, and to use a UPS to help keep clean power going to the server and avoiding crashes after a power blip. It would be smart if someone created a script to run a periodic md5 calculation on each drive periodically, with an ability to compare that with the current data on a disk. Although it would not help recover damanged files, at least it would identify files that have been corrupted. I agree with this, I have occasionally run md5 on my data to see if anything has changed... but i have not in a while as nothing was changing. I probably should do it again, thanks for reminding me. Something similar to bubbaQ's smarthistory package would be great for a job like this. HIs smarthistory is a great little tool that warns me if/when a drive has reallocated sectors and the like. I have seen a couple of my drives with sector reallocation counts go up over the year or so of using unRAID that i would have never know about otherwise. I also took the advice of many here on the board and got a UPS to hook up to the tower, just in case something what to happen and the power was to go out. It is tied in with the powerdown script so that everything shuts down correctly in case of a power failure. I have had the power blip in and out on me many times and the UPS has saved me on those occasions. I have also lost power while at work and the tower has shut itself down cleanly and safely so that when I get home, i don't even realize that something has happened, short of checking my syslog to find out.
September 11, 200916 yr This is the Achilles heal of unRAID IMO. Actually, this is the Achilles heal of ANY computer that does not immediately re-read what was written, to verify it got written correctly. Unfortunately, this describes almost ALL computers we use. The "bit-error-rate" of modern drives is good... but not perfect. They layer CRC checks on the sectors, but that does not prevent an error, it just detects it. Look at any SMART report... the "raw" read error rate is not zero. There's a LOT of hardware error correction going on, constantly... having the correct bit values on the drive's end of the cable does nothing to help if the bits are corrupted by noise when they get to the other end. The big question is what happens when a byte is not read accurately, regardless of the cause (disk vs. cabling vs. power-supply vs. MB chipset vs. disk-controller). On MS-Windows we get Blue-screen-of-death... or Cannot open document, or something in between. On unRAID, we get a parity check error. I know one thing for certain, we can at least check parity and verify we can read disk contents and do some basic tests... An MD5 verification sounds like a great addition... (It will take a while to run, but who cares...) Probably something out there already does it... anybody have one they like? Joe L.
September 11, 200916 yr Thanks for all the help. I'm loving the community support here... it really speaks volumes in favour of unRAID, and I'll be sure to share that with our viewers. I'm going to check on the RAM, and even remove one of the sticks (I have 2x 512 MB) just to be sure it's not something to do with incompatibility between RAM modules. I'll then try moving all drives off the motherboard and only use PCI controllers. If that fails, I'll buy a single PCI controller that supports all my drives. Feel a little like I'm drawing at straws, but I'm sure I'll find it (with all your wonderful help), and will report back what I discover. This morning I ran another parity check and immediately 2 errors showed up. I left it running when I left for work... I'll know the grand total after I get home. Thanks again!! Robbie
September 11, 200916 yr PS - just for the record, so you know: - I run all my computers on UPS's. I have one 1,000 VA (or higher) UPS per PC. I would never (EVER!) run a PC without a UPS, especially a box that's meant to hold my data - I never dirty shut down. If I power off or reboot the server, I stop it first, and shut down clean. - I'm using the latest "stable" (non beta) version of unRAID. - I have 3 drives: - 750 GB Parity on Motherboard SATA 0 - 160 GB Data Drive 1 on Motherboard SATA 1 - 400 GB Data Drive 2 on PCI Card SATA 0 Anything else would be worth mentioning?
September 11, 200916 yr Thanks for all the help. I'm loving the community support here... it really speaks volumes in favour of unRAID, and I'll be sure to share that with our viewers. I'm going to check on the RAM, and even remove one of the sticks (I have 2x 512 MB) just to be sure it's not something to do with incompatibility between RAM modules. I'll then try moving all drives off the motherboard and only use PCI controllers. If that fails, I'll buy a single PCI controller that supports all my drives. Feel a little like I'm drawing at straws, but I'm sure I'll find it (with all your wonderful help), and will report back what I discover. This morning I ran another parity check and immediately 2 errors showed up. I left it running when I left for work... I'll know the grand total after I get home. Thanks again!! Robbie You might want to post a syslog... in case it points out any other issues you might not be thinking about as of yet. Instructions under "Troubleshooting" in the wiki. I know it may be frustrating... but once you get stable hardware, it tends to stay stable for a long time. There is a memory test built into unRAID's boot menu. You might run it for a few cycles to get an idea if your memory might be an issue before you start swapping out RAM strips. Joe L.
September 11, 200916 yr I've run Memtest 86+ on the system (during the build procedure) and all checked out... so I'm guessing it's gonna boil down to the controller... cheap intel headers on the motherboard... I hope that's it, because it's easy to replace. Robbie
September 11, 200916 yr PS - just for the record, so you know: - I run all my computers on UPS's. I have one 1,000 VA (or higher) UPS per PC. I would never (EVER!) run a PC without a UPS, especially a box that's meant to hold my data - I never dirty shut down. If I power off or reboot the server, I stop it first, and shut down clean. - I'm using the latest "stable" (non beta) version of unRAID. - I have 3 drives: - 750 GB Parity on Motherboard SATA 0 - 160 GB Data Drive 1 on Motherboard SATA 1 - 400 GB Data Drive 2 on PCI Card SATA 0 Anything else would be worth mentioning? All good things... It shows you have more experience with power outages than many beginning unRAId users. Good that you know to stop the array before powering down. There are several add-ons that we have developed to help with the normal maintenance and usage of unRAID servers. Once you get settled, you will want to add the powerdown package, to allow a clean power-down from the command line, and a package to manage the UPS shutdown. (Most of us use apcupsd, as it supports the APC brand of UPS a lot of us use) Don't focus on those until you have your hardware stable... but Look in the wiki for add-ons... Joe L.
September 11, 200916 yr Anything else would be worth mentioning? What motherboard & controllers are you using? Post a syslog so we can see a picture of your hardware. Also the quality of SATA cables can be an issue of reliability. Your power supply can also be an issue. I actually have a strange problem when I went to a full array. I upgraded A 1TB drive to 1.5tb and added an additional 1tb green drive. It seems whenever the drive went to sleep and unraid would try to spin up the 1TB green drive, it would go off line. The same drive worked fine in an external SATA case, so it was not the drive. I dropped one of the spare drives on the array and right now it's stable, but I know there is an impending issue of power at some point. I'm not saying this is your issue, but strange things can happen from power or cable issues. Memory issues will many times reveal themselves as Kernal OOPS. A smartctl long test is wise too. smartctl -t long also the pre-clear script is a great way to get a baseline on your drives health. If you start reallocating sectors after using this script then the drives health is questionable. I don't know if I would switch to a PCI controller. I would just choose one that is know to work reliably.
September 11, 200916 yr There are several add-ons that we have developed to help with the normal maintenance and usage of unRAID servers. Once you get settled, you will want to add the powerdown package, to allow a clean power-down from the command line, and a package to manage the UPS shutdown. (Most of us use apcupsd, as it supports the APC brand of UPS a lot of us use) Don't focus on those until you have your hardware stable... but Look in the wiki for add-ons... Joe L. http://code.google.com/p/unraid-powercontrol/ This is an easy addition to your system and it will save the syslogs upon reboot. Make sure you have the latest smartctl on your system too.
Archived
This topic is now archived and is closed to further replies.