April 12, 200818 yr OK... Oddly enough, none of the drives have errors (in the disk status), but when I did a parity sync (on newly formatted drives) I received a lot of "sync errors" down in the command area. I think it was 318 or something (was close to the # of GB in the parity/data drives). All drives are 320GB drives, including the parity and the cache (which I don't think even comes into play for this scenario). Now, I searched all over, and can't find anything on the forums about what a sync error is, exactly. Anyone have an idea? I copied about 300GB worth of data, and it seemed to work fine... and now I'm re-syncing the parity, and so far I've come up with 38 sync errors (14% of the way thru). I am a little concerned about what this actually means.
April 12, 200818 yr Author After doing it again... I have 517 sync errors. Still, none of the drives show errors in the Disk Status area.
April 12, 200818 yr The whole basis of unRAID is parity. Think of each disk as being a long string of bits (1s and 0s). Parity is computed by examining the corresponding bit positions of each disk (e.g, the first bit of disk 1, first bit of disk 2, ...) and determining if the result is even or odd. If it is even, the corresponding bit of parity becomes a 1. If it is odd, parity becomes a 0. So if you add up all the corresponding bits (including parity), you will always get an odd number. To do the check unRAID is rechecking these calculations and looking for positions that don't add up to odd numbers. There are things that can happen to induce parity errors. For example, if you had a power failure it would not be unusual to have a few parity errors after restarting. If you took a disk out of the array and wrote to it on another system and then put it back, you'd expect lots of parity errors. When unRAID encounters a parity check error, it recomputes parity (it assumes the parity is wrong and the data is right) and rewrites parity. So, if you did have some parity errors, they should fix themselves and the next parity check should be clean. But if you have kept your drives in your array and run a fully parity check and found hundreds of errors, and then ran the parity check again and again got hundreds of errors, that is not normal In essense, you are not able to reliably write, reliably read, or both to you drives! Are you, by chance, using any IDE drives connected by round IDE cables? These have been known to cause data errors. Sometimes using cable select settings on the drives can cause problems, so you should set master / slave settings on IDE drives explicitly. If you are using SATA300 drives on SATA150 ports, make sure they are jumpered correctly. I would double check all these types of hard disk jumpers and connections. Examine the cables for breaks or cuts and replace anything that looks suspicious. Make sure everything is plugged in securely. Before shutting down, you should capture a syslog (follow the link at the bottom of my post for instructions). The log will contain the locations of your parity errors, which may give some hints as to what might be happening (if they are all in the same area of the disk, that may mean something else). I'd also make sure that the drive's SMART features are turned on in the BIOS.
April 12, 200818 yr Author Thanks, This is the end of my syslog, going to shutdown and check cables this evening when I get back home (as well as run a few other checks someone suggested in another thread). Well, actually I won't post the whole thing... I've got it saved. But the gist of it is: Apr 12 11:41:42 Tower logger: mover finished Apr 12 11:43:31 Tower emhttp: shcmd (33): /usr/sbin/hdparm -y /dev/sdf >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (34): /usr/sbin/hdparm -y /dev/sdh >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (35): /usr/sbin/hdparm -y /dev/sdg >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (36): /usr/sbin/hdparm -y /dev/sdi >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (37): /usr/sbin/hdparm -y /dev/sda >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (38): /usr/sbin/hdparm -y /dev/sdb >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (39): /usr/sbin/hdparm -y /dev/sdd >/dev/null Apr 12 11:43:31 Tower emhttp: shcmd (40): /usr/sbin/hdparm -y /dev/sde >/dev/null Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sdf Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sdh Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sdg Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sdi Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sda Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sdb Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sdd Apr 12 11:53:09 Tower emhttp: spinning up: /dev/sde Apr 12 11:53:52 Tower kernel: mdcmd (108): check Apr 12 11:53:52 Tower kernel: md: recovery thread got woken up ... Apr 12 11:53:52 Tower kernel: md: recovery thread checking parity... Apr 12 11:53:52 Tower kernel: md: using 1152k window, over a total of 312571192 blocks. Apr 12 11:56:11 Tower kernel: md0: parity incorrect: 7683800 Apr 12 11:57:34 Tower kernel: md0: parity incorrect: 12316504 Apr 12 11:59:18 Tower kernel: md0: parity incorrect: 18117360 Apr 12 11:59:24 Tower kernel: md0: parity incorrect: 18462824 Apr 12 11:59:27 Tower kernel: md0: parity incorrect: 18636776 Apr 12 11:59:31 Tower kernel: md0: parity incorrect: 18837096 Apr 12 12:00:09 Tower kernel: md0: parity incorrect: 20967768 Apr 12 12:00:19 Tower kernel: md0: parity incorrect: 21536872 Apr 12 12:00:33 Tower kernel: md0: parity incorrect: 22319336 Apr 12 12:01:00 Tower kernel: md0: parity incorrect: 23819496 Apr 12 12:01:42 Tower kernel: md0: parity incorrect: 26150272 Apr 12 12:03:31 Tower kernel: md0: parity incorrect: 32225608 Apr 12 12:04:03 Tower kernel: md0: parity incorrect: 34019704 Apr 12 12:04:38 Tower kernel: md0: parity incorrect: 35934936 . . . Apr 12 14:55:39 Tower kernel: md0: parity incorrect: 609798048 Apr 12 14:55:39 Tower kernel: md0: parity incorrect: 609819296 Apr 12 14:55:41 Tower kernel: md0: parity incorrect: 609926048 Apr 12 14:55:45 Tower kernel: md0: parity incorrect: 610156832 Apr 12 14:55:57 Tower kernel: md0: parity incorrect: 610841736 Apr 12 14:56:34 Tower kernel: md0: parity incorrect: 612903440 Apr 12 14:56:41 Tower kernel: md0: parity incorrect: 613289472 Apr 12 14:57:15 Tower kernel: md0: parity incorrect: 615161688 Apr 12 14:57:33 Tower kernel: md0: parity incorrect: 616169416 Apr 12 14:57:50 Tower kernel: md0: parity incorrect: 617149040 Apr 12 14:58:07 Tower kernel: md0: parity incorrect: 618078088 Apr 12 14:58:13 Tower kernel: md0: parity incorrect: 618385816 Apr 12 14:59:40 Tower kernel: md0: parity incorrect: 623252720 Apr 12 14:59:45 Tower kernel: md0: parity incorrect: 623539808 Apr 12 14:59:54 Tower kernel: md0: parity incorrect: 624059776 Apr 12 14:59:57 Tower kernel: md0: parity incorrect: 624207496 Apr 12 14:59:58 Tower kernel: md0: parity incorrect: 624253808 Apr 12 14:59:59 Tower kernel: md0: parity incorrect: 624338576 Apr 12 15:00:08 Tower kernel: md0: parity incorrect: 624832112 Apr 12 15:00:10 Tower kernel: md0: parity incorrect: 624951344 Apr 12 15:00:14 Tower kernel: md: sync done. time=11181sec rate=27955K/sec Apr 12 15:00:14 Tower kernel: md: recovery thread sync completion status: 0 Apr 12 17:45:53 Tower in.telnetd[32702]: connect from 192.168.2.198 (192.168.2.198) Apr 12 17:45:58 Tower login[32703]: ROOT LOGIN on `pts/0' from `192.168.2.198'
April 13, 200818 yr If I recall right, in another thread you ran some pretty drastic, low-level procedures, that may have gone behind the back of unRAID, and the parity drive may not have been kept in sync. For future reference, I would forget about parity during operations that drastic, just un-assign the parity drive, then perform what ever formatting, clearing, testing, and copying operations you want to the data drives. When you have them the way you want them, and fully operational, then assign the parity drive and build parity. There are two advantages to this, you remove the complications associated with maintaining parity, and you will have much better performance, especially when copying to your array. One point, when you had many sync errors, but they did not appear on the Status screen, is it possible that you hadn't Refreshed the screen? That screen does not automatically refresh at this time, but there is a Refresh button that will update all stats to current values. My apologies if you knew that, and something else was wrong.
April 13, 200818 yr Author I'm not sure I understand how what I did previously could be a problem, if "sync error" stands for mismatched parity (which corrected itself). I did do a bunch of crap in order to format the drives... but once I did all that, they showed up as "unformatted" (I did not bother doing anything to parity disk, only to the data drives). Once I had the data drives unformatted (so disks 1-6 and cache were "unformatted"), I had unraid format them all (except parity), and then did a parity sync. During that parity sync (when ALL the drives were empty), I came up with 320ish "sync errors". I copied about 300GB worth of data over, and then decided to do another parity sync, at which point I got 517 "sync errors". I am concerned about what exactly this means, considering version 3 (which I just upgraded from) didn't even have a "sync error" notification, that I recall... so I don't know if it had been previously having issues or whether this is a new thing with 4.3. I'm really concerned that this may mean my parity isn't actually good. Here is a screen crop: I'm off to jiggle cables, run a memtest, and try some smart tests...
April 13, 200818 yr I'm sorry, I in no way meant to imply that you had done anything wrong, you didn't. I only wanted to *try* and explain (not very well) why unRAID might be confused, and why so many parity errors occurred. On the good instructions of others, you performed some non-standard procedures, because unRAID does not currently have any other way to do what you wanted. Those non-standard commands may have *confused* unRAID, and caused out-of-sync parity values. That is why I suggested, in hind-sight, it might be better to un-assign the parity drive, so you won't have any parity problems to worry about. What possibly caused the parity sync problems is that unRAID has no concept of a parity-protected array that includes an unformatted drive, it just never happens. Drives are partitioned and cleared, then parity is updated along with the formatting of the drive. I should also have mentioned that I too did not understand why there was a second batch of 'sync errors', that is very unusual, but I figured it was still due to the system recovering from the 'unformatting'. Any way, your system found and fixed 517 errors, and is not reporting any since. To build confidence in the system, you may want to copy another 100GB or more to other drives, and do another parity check. I don't anticipate finding any more errors.
April 13, 200818 yr Author Oh, I didn't mean to come off as offended. I just didn't see how a parity sync on completely new/formatted drives would show parity sync errors as was described to me. (then show MORE later) It is weird, the drives themselves don't show any errors (like I have previously seen in uraid 3.0 on an old dying hard drive), just the strange sync error which doesn't get logged to a drive at all. I think I may have tracked down the problem, however... so far I'm running memtest and have had 2 errors in three passes on the memory. I certainly hope it is something as simple as bad RAM... but I'd still like an official word on what exactly the "sync error" entails, so that I can be a little more comfortable trusting the system to backup all our work.
April 13, 200818 yr so far I'm running memtest and have had 2 errors in three passes on the memory. This is a huge "uncorrectable" problem unless you have ECC ram. It would be a cause of corruption before and after the drive maintenance procedure. I would not move any more data until your memtest runs clean. I would suggest, After your memtest runs clean, do the reiserfsck on all your drives. (after they are unmounted). Let us know if you need more details on doing the reiserfsck on your "unmounted" drives. Do not do it on the drives while they are mounted.
April 13, 200818 yr Author Well, crap. After running memtest all night, it seems the memory I have is crap...17 errors in 60 passes, at 218.4MB and at 411.1MB. And wonder of all wonders, the 4 different sticks of RAM I have as spare (or in other systems) just gives me beeping error codes. looks like I need to track down the mobo model # and see what kind of RAM is compatible with it. Thanks guys... I'll keep this updated if/when the new RAM fixes the problem (or doesn't).
April 13, 200818 yr Well, crap. After running memtest all night, it seems the memory I have is crap...17 errors in 60 passes, at 218.4MB and at 411.1MB. And wonder of all wonders, the 4 different sticks of RAM I have as spare (or in other systems) just gives me beeping error codes. looks like I need to track down the mobo model # and see what kind of RAM is compatible with it. Thanks guys... I'll keep this updated if/when the new RAM fixes the problem (or doesn't). The bad RAM could also easily be the reason for the parity errors. You might be able to adjust the timing or voltage for the RAM in your BIOS to make your existing RAM work without errors. (some need a bit higher voltage than others) Joe L.
April 13, 200818 yr You could review the bios and adjust memory timings. I.E. relax them if they are aggressive or in turbo mode.
April 13, 200818 yr Not to pile on here, with more bad news, but if it were my system, I would have to consider ALL of the data stored on the unRAID server as 'crap' too. When you get your memory situation fixed, I would run some file compare tests on your data, or just replace the data with their originals or their backups. There may not be any file corruption issues, or only a few corrupt files, but until you know for sure, how can you trust any of it. Sorry... For archives saved on your server, you can usually run their self test. Zip-based archiving and 7z's and RAR's and probably most compressed archives have a way to test for bit-perfect integrity.
April 13, 200818 yr The whole basis of unRAID is parity. Think of each disk as being a long string of bits (1s and 0s). Parity is computed by examining the corresponding bit positions of each disk (e.g, the first bit of disk 1, first bit of disk 2, ...), adding them together, and determining if the result is even or odd. If it is even, the corresponding bit of parity becomes a 1. If it is odd, parity becomes a 0. So if you add up all the corresponding bits (including parity), you will always get an odd number. To do the check unRAID is rechecking these calculations and looking for positions that don't add up to odd numbers. Everything you said is absolulutely true, execpt that unRaid uses "even parity" and you described "odd parity" So, when adding up the bits, if the total is even across a bit position on all the data disks, the corresponding bit position on the parity drive is set to 0. It the data disks total is odd at a given bit position, a "1" is written to the parity disk to bring the total across all drives at a given bit position to an even number. When adding up all the bits at the corresponding bit position across all the disks (including parity) you will always get an even number. Joe L.
April 13, 200818 yr Author Unfortunately, this is one of the first SATA unraid boxes you could buy pre-built. I'm not sure if they are the same now or not. But this mobo has no RAM settings, only has a (hot) Duron processor and really no customizations available for any sort of memory/processor/frequency/etc. I'm thinking about replacing the motherboard, processor & RAM instead of trying to track down what RAM will actually work with this.
April 19, 200818 yr Author Well, no luck with any RAM that I can (easily) get my hands on working with that mobo... so I'll probably use the mobo/cpu/ram for something else (non-crictial) and start from scratch. For now, I'm back to storing multiple copies of the same data on different hard drives Wish I would have figured this out BEFORE I managed to format everything. Any official word... is the bad RAM certainly the cause of this? Also is the official definition of "sync error" = parity was invalid and needed to correct itself?
April 23, 200818 yr Author Well... I found some RAM that worked (one 1GB stick, not dual channel anymore) and it passed 4 hours of memtest. So I put it back in... and I have ERRORS when checking parity! (oh teh noes!!one!11!) But fortunately, it seems that it was correcting the old parity errors... because I have re-synced two more times (after the initial one) and have ZERO errors now! Woohoo!! So, you guys were correct... fixing the RAM seems to have fixed the sync error issue. Thank you for everyone's input.
Archived
This topic is now archived and is closed to further replies.