Behavior I'm having a hard time understanding

May 21, 200818 yr

It had been 2 months since my last parity check. I thought what the hell, why not. Well, I'm glad I did, because I had thousands of parity errors, and also thousands of errors on my parity drive itself.

Subsequent re-checks didn't show as high a number of parity errors, or errors on the parity drive. A couple of hundred. I didn't write down the exact numbers. In total I probably ran 4 parity checks. Each yeilded a different number of errors, even though I wasn't using the server for anything else but parity checks.

I was starting to think the parity drive was bad, but I just had a feeling that it wasn't.

I decided to assign the parity drive to a unused data disk slot. Decided it was worth the risk to not have parity for a while since it probably wasn't working right anyway. Once this process completed I ran a reiserfsck on it. No errors, everything looked good.

I then assigned the drive back to parity. The first parity check after doing this, and everything looks great. No errors of any kind showing up. I'm going to start another parity check before going to bed, but I'm interested on everyone's thoughts on this matter.

Quote

May 21, 200818 yr

reiserfsck on a parity drive does not work, or shouldn't and if could possibly corrupt the parity drive's integrity (the data, not the physical drive).

I'm curious if the parity check just checks, or it rewrites/recreates the parity for each block.

If you've been running reiserfsck on your parity drive, that could be the cause of the discrepancies.

Quote

May 21, 200818 yr

Author

No no, you misunderstand what I did. I made the parity drive into a data drive before running the reiserfsck. Then I set it back to parity. Aside from temporarily not having parity protection, I wouldn't think doing this would cause any problems.

And by the way, I now have run through the parity sync a second time since doing this procedure, and no errors anywhere again.

And like I said before, can anyone provide any insight into why this "fixed" my problem?

Quote

May 21, 200818 yr

When you get an error on a drive (in the far right column in the GUI), that means that unRAID experienced an error while reading or writing to the drive. A write error is quite severe, and actually causes unRAID to take the drive out of service. But a read error increments that count and attempts to correct the error (see below). The reason for the read error usually means that there is a bad sector on the disk.

(When a read error occurs on a drive, the drive itself realizes that it has a bad sector. It marks that sector as "needs to be remapped".)

When unRAID encounters a read error on a drive, it reconstructs the contents of the sector by reading the contents of the other drives (similar to rebuilding a whole drive, but just for that one sector). It then writes that reconstructed sector back to the offending drive on the same sector number. The drive, realizing the sector "needs to be remapped" will assign a spare sector on the drive and take the actual bad sector out of service. The end result is that the sector will contain the right data the next time it is read.

So, once unRAID reports one of these errors, realize that unRAID has already taken steps to correct the errors. You should NOT get the same error twice in a row.

* * * * *

When you get a parity sync error (as reported at the bottom of the screen while doing a parity check), it does NOT mean that you have a bad drive. It normally means that you have powered down your array without a clean shutdown. It could also mean that you wrote to the drive without parity protection in place. Running reiserfsck to correct something might have such an effect, depending on how you run it.

When a parity sync error occurs, unRAID will recompute PARITY for the sector in question and rewrite it back to the parity drive. This should correct the problem. Since unRAID always corrects this type of problem by writing to the parity drive, unRAID will not corrupt any data. (unRAID "trusts" the data on the data drives over parity data unless there is an actual read error on a data disk or a need to reconstruct a drive).

Just as with the reported disk errors, once unRAID reports one of these "errors", realize that unRAID has already taken steps to correct the errors. You should NOT get the same error twice in a row.

* * * * *

I cannot explain the behavior you are reporting. What I would suggest is that you run "smartctl" (click the link to the wiki in my sig, follow the link to the hard disk error section. You will find instructions to download and run it.)

One possible explanation is that your parity drive had a large number of sectors go bad, and over the course of time remapped them all. This seems unlikely, but fits the facts.

If you are asking should you trust the parity drive, I would say look at your smartctl data. If it is showing evidence of a bunch of sector remaps, then I would say the drive is going bad and you should replace it. If not, then it seems that perhaps a lose or bad cable or backplace might have been the cause. If such a problem spontaneously fixed itself, I'd be skeptical.

Run and post your output and the community may be able to help you interpret the results.

BTW, whenever I get any type of error (drive error or parity sync error) I ALWAYS capture a syslog. It can provide invaluable information to help figure out what happened, as all of these are logged. Once you reboot the chance is lost. If it is some kind of unRAID bug, Tom would need this log to see what happened and have any chance of fixing it. You can look at the bad sector numbers and see if there are sequential (meaning perhaps a certain part of the disk has failed) or seemingly random. All this can help lead to a likely cause. Otherwise we're just guessing without any facts. Instructions for capturing the log are also provided in the wiki.

Post your smartctl results. That should give some good information.

Quote

May 21, 200818 yr

No no, you misunderstand what I did. I made the parity drive into a data drive before running the reiserfsck. Then I set it back to parity. Aside from temporarily not having parity protection, I wouldn't think doing this would cause any problems.

No, it would not cause problems, but it would not be an effective test of the drive either. the reason is this...

Creating a reiser file system involves writing very few blocks of data to a disk. Checking the resulting file system, just after you created it would read those same few blocks of data and nothing more. Reiserfsck is not a full disk scan. Anything that does that would take hours on larger disks.

With all that in mind, all you proved was that some small number of blocks on the parity drive could be written to and read back. The errors originally reported could have been in entirely different places on the disk. (odds are very high they were not involved in the file system check you performed)

And by the way, I now have run through the parity sync a second time since doing this procedure, and no errors anywhere again.

That is good news. Your symptoms seem to match that of bad blocks on the parity drive being re-mapped by it to spare sectors. When the "read" errors occurred, it would have marked the sectors as bad and the next "write" to that sector would actually write to the remapped sector. The only way to know the true health of the parity drive is to run "smartctl" on it as bpj999 described.

And like I said before, can anyone provide any insight into why this "fixed" my problem?

Odds are the parity drive's bad sectors were re-mapped to good ones. If over time, more bad sectors occur, it is time to RMA the drive and get a replacement. If no more bad sectors occur, you might be fine for years to come. You should do a monthly parity check. It reads all the sectors on all the drives, if an error were to occur, it is best detected early. (Actually, best to run "smartctl", get the statistics on the disks, run a parity check, then run "smartctl" again. The statistics of bad and re-mapped blocks (or blocks pending re-mapping on their next write) should not change by much, if at all. If they do, you can decide if it is bad enough to replace the drive with the problems)

Joe L.

Quote

May 21, 200818 yr

.... and also thousands of errors on my parity drive itself.

Subsequent re-checks didn't show as high a number of parity errors, or errors on the parity drive. A couple of hundred.

Thousands of errors on the parity drive is bad. 1-10 remapped sectors is one thing. 1000s is quite another. The drive only has so many spare sectors for remapping. The fact that even after getting and correcting 1000s of errors, there were still more the next time through is evidence of imminent failure. Although there could be some external cause (like a bad disk controller), if these are true disk errors it is way past time to RMA the drive! smartctl will help answer some of these questions.

JoeL is 100% right that formatting a drive is no test at all. Unless there is a bad sector on the first 50 MEGAbytes (maybe less) of the drive, you're not going to have format complain.

Quote

May 22, 200818 yr

Author

Well, I guess I was hoping that unraid was giving me erroneous messages in some way. I guess not. Smart info is completely independant of OS. Attached is my parity drive smart info. I guess my only question now is if I should replace drive asap, or wait for other problems. My gut tells me to replace.

Quote

May 22, 200818 yr

Interesting - your attributes look fine, but the fact that the drive has had 2142 errors is not good. It is hard to tell by looking at the last 5 what the error is. I don't have any reference guide that decodes these command register values.

One possible explanation could be a bad data cable. If the command strings are getting garbled on the wire, it could cause errors where the drive gets an invalid command and returns an error. This looks like what may be happening.

The other possible explanation is that the drive is broken somehow.

I'd suggest changing the data cable to your parity drive. If the drive is in a backplane, I'd suggest plugging it directly into the other end of the cable you just replaced. I'd then run a parity check. Don't worry if you get sync errors, as long as you are not getting drive errors. Then run smartctl. If the number of errors has not incremented (over 2142), and the attributes stilll look good, I'd say that was likely your problem. You could then run parity check again. There should be no sync errors the second time, and smartctl output should still show no more errors.

If this doesn't work try plugging the parity drive it into another SATA port. In rare situations users have reported one of the SATA ports on the MB not working.

Quote

May 22, 200818 yr

Good analysis, Brian, and I concur, the drive looks fine.

TSM, keep in mind that parity errors are not disk errors. They are an indication of a parity mismatch, which could be caused by a disk error on any of the drives, or just a modification of a data drive outside of the parity protection, that is, a write to a data drive that does not go through the unRAID driver, and therefore does not update the parity info.

It's too bad you were not able to capture a syslog, when you had the thousands of errors. It would have probably indicated what is wrong, and which drive.

You might try a SMART drive test, short or long, on ALL of the drives. Here's one post and a wiki article, near the bottom. (I'm not sure, but I *think* that there should be a space after the -t and before the short or long.)

Quote

May 22, 200818 yr

Author

The last 2 times I ran a parity check, everything was good. No sync erros and no errors on the drive. I am using a drive enclosure. Same place in the enclosure, same cable, same sata port. Its gotta be the drive. I guess I had a lot of bad sectors that it was just able to remap. Its a terabyte drive, so its stands to reason that a terabyte drive would have more "spare" sectors to use for remapping than a smaller drive would.

My plan at this point is to run a parity check weekly. If its good for a month or 2, I will go to a monthly check.

Quote

May 22, 200818 yr

Author

I wasn't going to post these, but in light or RobJ's most recent commend I thought I should. I thought about that, that maybe somehow my problems were a result of the older drives I have in my system. Disk3 and Disk5 are probably almost 2 years old and were inside of my previous NAS. All SMART info on the other drives looks clean to me.

I have no disk4 by the way.

Thank you to everyone who hangs out on this board. You guys are awesome.

Quote

May 22, 200818 yr

They all look good. Could have been that cable or port.

Quote

May 22, 200818 yr

The last 2 times I ran a parity check, everything was good. No sync erros and no errors on the drive. I am using a drive enclosure. Same place in the enclosure, same cable, same sata port. Its gotta be the drive. I guess I had a lot of bad sectors that it was just able to remap. Its a terabyte drive, so its stands to reason that a terabyte drive would have more "spare" sectors to use for remapping than a smaller drive would.

My plan at this point is to run a parity check weekly. If its good for a month or 2, I will go to a monthly check.

According to the smartctl posted earlier, you did NOT have a single sector remapped on your parity drive. Your errors were apparently unrelated to the media. I believe that the errors were caused by bad commands sent from the computer to the drive. This was most likely caused by some form of data corruption (e.g., bad cable) that garbled the command en route. Maybe you jiggled something and it made a better connection. At the very least, you should start running parity checks frequently before you can have confidence that the problem won't suddenly recur.

Quote

Behavior I'm having a hard time understanding

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)