June 30, 200719 yr I just ran a parity check and it returned sync errors, but there were no drive errors. What are sync errors and what do they indicate?
July 3, 200719 yr Author I can't believe nobody knows the answer to this. I've run multiple parity syncs now and they all return 100+ errrors. What does this mean and what should I do about it? I bought unRAID to help protect my data, but I don't feel too secure when it keeps telling me there are errors.
July 4, 200719 yr Sync errors count the number of mismatches between calculated parity and stored parity. Normally after a parity-check this should of course be 0. It can be non-zero as a result of ungraceful shutdown, system crash or sudden power outage.
July 4, 200719 yr Author None of those things have happened. I haven't turned off or rebooted the system and we haven't had any power outages. I've ran multiple parity checks and they all return errors. What else could cause them?
July 4, 200719 yr "Parity-check" will also correct parity mis-matches, so if you run 2 parity-checks in a row, and 2nd check also produces errors, then probably there's a bad hard drive (or bad cable or bad power). Try running 2 consecutive parity checks, and if the second check shows a non-zero error count, then post the syslog.
July 5, 200719 yr Author I just started a consecutive parity check (the first one had 155 errors) and it's only 0.4% along, but there are already 2 errors. I will post a log when it's finished (morning).
July 5, 200718 yr Author The first parity check returned 155 errors ans a a second consecutivde check returned 162 errors. I have attached the syslog.
July 5, 200718 yr The first parity check returned 155 errors ans a a second consecutivde check returned 162 errors. I have attached the syslog. I wonder if the full set of SMART data would be helpful. If that shows nothing wrong, I would first try new data cables before replacing drives. Bill
July 5, 200718 yr The first parity check returned 155 errors ans a a second consecutivde check returned 162 errors. I have attached the syslog. Thank you. Let me look at this for awhile & I'll post back later this evening.
July 5, 200718 yr Well, I had some comments and suggestions, but I see Tom is working on it now. Perhaps I'll share my thoughts while you are waiting for the expert answer. In your syslog, the parity errors stop as soon as it finishes with the smaller drives, so I would think that eliminates all of the larger drives as problematic. At one hour and 1 to 3 minutes after the last parity error, 6 drives are spun down, hde, hdf, hdg, hdi, hdk, hdl. That would seem to point the finger at one or more of these 6 drives, or 1 or more of the drive controllers involved with these 6, or perhaps some kind of bus contention involving them. The 3 SATA drives (sda, sdb, sdc) and the 2 IDE drives (hdh, hdj) continue to the end of both parity checks without any more errors. Also, in comparing the lists of errors between the 2 parity check runs, about a fifth of them are the exact same sector/cluster/block, which seems way, way, way too coincidental to be random, which seems to eliminate cables, I think. It makes bus contention very unlikely too (I think). Actual data errors on the the hard disks seems impossible to me, since you should have seen read errors listed, and there are none. Hard errors might not be visible because the hard drive would remap the sectors (which would show up in the SMART data as Reallocated Sectors), but then you would not see sectors repeated on a second parity check run. Tom will have better ideas. What I would do while waiting for Tom, is try some repeated file compares, with data on the 6 drives and the original source of that data, or a backup copy elsewhere (I'm assuming there is a backup of the data, or this data is a backup itself). It might be helpful also to try simultaneous reads of the drives and the suspect drive controllers, to force bus contention. You don't want to write to the drives, because that would involve the parity drive. Perhaps these file compares would help determine which drive or controller is bad. Either the compares are successful, and you know your data is safe, or they fail consistently or inconsistently, and you learn which ones can be trusted and vice-versa.
July 7, 200718 yr RobJ posted a great analysis! At first I was thinking there was a bug in party-check where it might not be writing corrected parity back to the parity drive, but looking at the code and running several tests this is not the case, that is, code works fine. The fact that the same block comes up in multiple passes as having bad parity is very troubling. The only explanation I can think of is you have a drive silently returning bad data. Unfortunately there is no easy way to isolate this. Using RobJ's analysis, one of the smaller 6 drives is the culprit. You should remove 3 of them, rebuild parity, then do parity-check. If fails, add other drives one-by-one until you find the one that's causing sync-errors. If succeeds, then remove those 3, put in other set of 3 and repeat.
July 14, 200718 yr Author I've been trying to figure out which drive is the problem. The last parity check I performed on the full array yielded 222 errors. I unassigned all six of the suspect drives and I've been adding one back and rebulding the array/parity and then doing a parity check. After removing the suspect drives, it returned no errors. I added hde back in and it returned 48 errors. I remove hde again and add hdf and a parity check said 57 errors. I am currently trying yet another one of the suspect drives, but that would seem to indicate 2 problem drives so far and they still don't equal the total number of errors I was getting. I find it really hard to believe that 2 drives (at least. probably more) all of the sudden decided to go bad at exactly the same time. Something seems fishy here. Also, isnt there any other way to check the physical well being of a drive? This process is extremely slow and tedious. Will everyone have to do all of this evertime a drive goes bad? I would think unRAID should tell you which drive is the problem. It sucks having half the array off line for days on end while I try to figure this out. Not that it's all been unRAID's fault, but my server has been a problem more than it's been fine and that's really disconcerting. Even my wife mentioned that we never had all these problems before when it was just a generic non-RAID Win2K server. I'm not trying to dog unRAID here, I'm just statimg my experience. I really want unRAID to work for me, but this is beginnig to get frustrating.
July 14, 200718 yr I'm not sure why you think unraid is the problem - this appears to be hardware, not software. Your Win2K box would likely be having the same issue. Try moving the good drives to the connectors where the bad drives were and try again. If you get failures, then it isn't the drives. To confirm, try putting the "bad drives" on the good connectors. Bill
July 14, 200718 yr I greatly admire the brevity of users like Orb, but I'm afraid this will be another long-winded post. Forgive me. I had originally thought to suggest the traditional tried-and-true, isolate the hardware, 'divide-and-conquer' approach that Tom suggested, of eliminating selected drives and retesting, and felt, as you have found, that it would be a lot of work and system downtime. Tom had less 'compunction'! It IS the best way though, and you HAVE gained useful info. I have to agree with you, after seeing parity errors on multiple drives (and still no actual disk errors reported), that the drives are probably fine. That leaves the bus and its communications as suspect. The suspect list looks to me like this: cables, heat, faulty power, either or both add-on disk controller cards, the motherboard buss-es. Before I get to the suspects, I have a comment. I didn't mention it in the last post, because I was trying so hard to keep it short. The 2 IDE drives (hdh and hdj) that seem to be fine, are larger drives and possibly newer, but there's one more distinguishing thing about them. They each are the last drive functioning on their particular add-on disk controller. That makes me wonder if the other 6 drives would be fine too, if each was the only drive on its controller, especially now that we think all of the drives are probably fine. This makes one or both of the controllers as prime suspects. You could test this (if you have time) by unhooking hdh and hooking up hde again, running parity sync, then parity check. Also, you might try testing the other controller with a second drive, a drive on the same controller as hdj. Cables: Elsewhere, you have said you are only using 'short, flat and 80 wire', and have changed them multiple times, so we can probably release this suspect. Motherboard: The fact that a full parity check without the 6 drives works fine, seems to me to indicate motherboard is fine. Two drives, hdh and hdj, are using the PCI bus without errors, so PCI chipset is probably fine. (But see heat below.) Heat: Elsewhere, you mention having a cool case, and extra fans for the drives, but you might check for good airflow across the motherboard and the controller cards. A parity check runs your hardware rather hard, generating extra heat. I noticed a Newegg review concerning your motherboard that mentioned very hot north and south bridge chips, 'gets sooo hot it will burn ya'. Try touching the bridges (2 square copper heatsinks), and the controller card chips (after touching grounded case metal). If a chip is getting too hot, that *might* be a cause of problems. Adding a side fan or 'spot' fan aimed at the motherboard might help. Power: This is always a prime suspect, and although you have switched out your power one or several times, I strongly recommend purchasing a top quality power supply. Disk controller cards: As stated above, I strongly suspect one or both of your cards. I noticed you only have one PCI Express X16 slot and 3 PCI slots, no PCI Express X1 slots, so you have limited options, but I recommend purchasing another controller, perhaps different brand or model, and seeing if that makes a difference. I'm sorry I can't think of a faster way to test, but you have to have a base state to test against, and a full parity sync is what provides that. The parity check can be aborted as soon as you have seen multiple errors. You probably already have determined a point by which you know if errors are going to happen, and there's really nothing to learn past that point. I'm guessing parity checking the first 20 gigs should be enough for a pass/fail judgment? Sorry for another long-winded post. These are hard work. I'm afraid I'm not very dependable at followup posts, I seem to burn out quick.
July 14, 200718 yr Author I'm not sure why you think unraid is the problem - this appears to be hardware, not software. Your Win2K box would likely be having the same issue. Try moving the good drives to the connectors where the bad drives were and try again. If you get failures, then it isn't the drives. To confirm, try putting the "bad drives" on the good connectors. Bill I didn't say unRAID is the problem. I'm just saying that this is whole transition to unraid has been frustrating. Whatever the cause. Also, it would be nice (ultimately I say expected) that unraid tell the user the exact drive(s) where the problem(s) are occurring so that we can get right to the root of the problem instead of spending days on trial and error. These are just my opinions and feelings.
July 15, 200718 yr Author RobJ - I truly appreciate the time and effort put into all responses. Especially long ones like yours because it usually means the poster took the time and effort to cover many points. You have made many valid and useful points in your post. A few points of interest: 1. I am not using the on-board IDE channel on the mobo because I was having problems with the drives on it being detected in BIOS. 2. I am using two different PCI controller cards: a Promise Ultra 100 TXII and a Highpoint Rocket 133. 3. The southbridge is pretty hot to the touch. Everything seems pretty cool. Since hde and hdf both returned errors and they both have the common denominator of being on the same cable, channel, and controller, I have decided to replace the controller card since I have a spare. They were connected to the Highpoint Rocket 133 controller card which I have replaced with another Promise Ultra 100 TXII. I am currently rebuilding the parity with hde added back in and then will run a parity check to see if any errors are returned. Hopefully this will yield some definitive answers.
July 16, 200718 yr Author Replacing the Highpoint controller card seems to have done the trick. I re-installed and re-assigned all drives to the array and let it build parity and then immediately ran a parity check and it returned no errors. I'll be keeping a close eye on it for the next few weeks. Hopefully everything will stay ok now.
Archived
This topic is now archived and is closed to further replies.