1st time for parity errors

April 1, 201412 yr

viewing the results after work today from last nights monthly check i find there were errors. Now this is the first time i have experience this so I look to the experts for guidance and or re-assurance.

the main page displays this message:

Last checked on Tue Apr 1 13:07:15 2014 CDT (today), finding 2 errors.

> Duration: 13 hours, 7 minutes, 13 seconds. Average speed: 84.7 MB/sec

the unmenu main status page displays this message:

STARTED, 14 disks in array. Parity is Valid:. Last parity check < 1 day ago . Parity updated 2 times to address sync errors.

there are no disk errors in the error column on the main screen.

i ran a smart report on the parity disk and it reports nothing wrong

copied from syslog at start of parity check to end of log:

Apr 1 00:00:01 Tower kernel: mdcmd (328): check NOCORRECT (unRAID engine)

Apr 1 00:00:01 Tower kernel: (Routine)

Apr 1 00:00:01 Tower kernel: md: recovery thread woken up ... (unRAID engine)

Apr 1 00:00:01 Tower kernel: md: recovery thread checking parity... (unRAID engine)

Apr 1 00:00:01 Tower kernel: md: using 2560k window, over a total of 3907018532 blocks. (unRAID engine)

Apr 1 00:00:12 Tower kernel: md: parity incorrect, sector=5464 (Errors)

Apr 1 00:00:12 Tower kernel: md: parity incorrect, sector=5656 (Errors)

Apr 1 04:40:01 Tower logger: mover started

Apr 1 04:40:01 Tower logger: skipping applications/

Apr 1 04:40:01 Tower logger: mover finished

Apr 1 06:58:49 Tower kernel: mdcmd (329): spindown 2 (Routine)

Apr 1 06:58:50 Tower kernel: mdcmd (330): spindown 3 (Routine)

Apr 1 06:58:51 Tower kernel: mdcmd (331): spindown 4 (Routine)

Apr 1 06:58:51 Tower kernel: mdcmd (332): spindown 5 (Routine)

Apr 1 06:58:52 Tower kernel: mdcmd (333): spindown 7 (Routine)

Apr 1 06:58:53 Tower kernel: mdcmd (334): spindown 11 (Routine)

Apr 1 09:08:06 Tower kernel: mdcmd (335): spindown 13 (Routine)

Apr 1 11:19:48 Tower kernel: mdcmd (336): spindown 1 (Routine)

Apr 1 11:19:49 Tower kernel: mdcmd (337): spindown 6 (Routine)

Apr 1 11:19:49 Tower kernel: mdcmd (338): spindown 8 (Routine)

Apr 1 11:19:49 Tower kernel: mdcmd (339): spindown 10 (Routine)

Apr 1 11:19:50 Tower kernel: mdcmd (340): spindown 12 (Routine)

Apr 1 13:07:15 Tower kernel: md: sync done. time=47233sec (unRAID engine)

Apr 1 13:07:15 Tower kernel: md: recovery thread sync completion status: 0 (unRAID engine)

Apr 1 13:37:22 Tower kernel: mdcmd (341): spindown 0 (Routine)

Apr 1 16:07:24 Tower kernel: mdcmd (342): spindown 9 (Routine)

should i run a "correct parity" check and then a "nocorrect" to verify?

do i just let it be?

Is this something to worry about?

Thank You

Quote

April 2, 201412 yr

There are a number of things that can cause parity errors.

By far the most common is a so called dirty shutdown. This occurs when there is a power outage or other server hang that requires you power down the server without using the stop array feature. To make a long story short, when this happens, the data disks usually handle it better than the parity disk. And the parity disk can show a few sync errors. (It is possible that the last file you copied to the array got corrupted, but this is unlikely, unless the copy was happening at the time the server was powered down. You might want to test the last few files you copied to the server and make sure they are good.)

After a dirty shutdown, a correcting parity check will update parity to match the data disks. Since the data disks handle the shutdown better than parity, this is your best best. You can run integrity checks on your data disks to make sure nothing got corrupted in the file system, but as I said data drives handle this pretty well and corruption is unlikely. If this happened to me, knowing I had a dirty shutdown, I would not be concerned.

Another telltale sign of a dirty shutdown parity error is when the parity block is very near the front of the disk (a low number), and this is what I see from your post - a parity error very early on the disk. The first part of the drive I refer to as the housekeeping section. It does not store file data, and includes a "journal" of writes that need to be performed on other parts of the disk. It is this journal area that holds recent data added to the array. If anything is corrupted it would likely be one of the last files written to the array.

But if there was no dirty shutdown, the possibilities of what might have gone wrong get very broad, and the probabilities of any one of them get very small. A drive could be malfunctioning. A drive could have a bad memory, Your computer may have bad memory. You could have so called "bit-rot" meaning that data is degrated, perhaps on a weak spot on a disk. You could have a loose cable that corrupted data transfer. There could have been a power surge. All sorts of possibilities. None of them very likely. You can test the memory in your computer, a wise course of action to rule that out. But most of the other issues are near impossible to determine.

But what you really want to know is - was any of my data corrupted. And that question cannot be answered. Of course with a single parity error you can believe that it is unlikely, and even if a tiny blip in a single file is corrupted that you might never know or notice. But you won't know for sure. This is referred too as silent corruption.

You might be interested in reading the linked post below about ways you can protect yourself from silent corruption, or at least identify if any corruption has occurred.

I think it is a bit funny that people want to be protected from a double disk failure, a virtual statistical impossibility, but are willing to accept the risk of something like you have encountered, which although rare, does happen with some regularity here.

http://lime-technology.com/forum/index.php?topic=31020.msg299402#msg299402

Good luck!

Quote

April 2, 201412 yr

should i run a "correct parity" check ...

Yes -- you certainly don't want to leave your array with parity incorrect !!

... and then a "nocorrect" to verify?

Your choice. I'd just run another correcting check to verify. [but I never run non-correcting checks -- if my parity's got errors, I want them fixed]

do i just let it be?

Absolutely NOT !!

Is this something to worry about?

No -- unless you start getting errors every time you check the array.

I think it is a bit funny that people want to be protected from a double disk failure, a virtual statistical impossibility, but are willing to accept the risk of something like you have encountered, which although rare, does happen with some regularity here.

I don't think anyone's "willing to accept the risk of something like you have encountered". I DO think a lot of folks aren't prepared to confirm the integrity of their data -- either via checksums or with backups they can compare to.

By the way, while it's certainly a fairly low risk, dual-drive failures are far from "... a ... statistical impossibility ..." with the size of many folks UnRAID systems. Modern consumer-oriented disks have non-recoverable read error rates on the order of 1 in 10^14 bits read => or one uncorrectable bit for every 12.5TB read. So it's actually quite likely there will be a bit error during reconstruction of a disk in a moderately large array. A dual parity system will, of course, mitigate this, as it will correct any of these errors.

As you know from my comments in my Backups thread, I think it's surprising folks go to the trouble of building extensive media collections; the expense of storing it on a fault-tolerant server; and then don't think it's worth the bother or expense to back it all up. As an absolute minimum, you should maintain checksums of your data, so if you get a parity error, you can confirm whether or not any of your data was corrupted. Knowing a file is bad isn't, of course, much use unless you have a backup to restore it from; but at least you'll know what's good & bad.

Quote

April 2, 201412 yr

should i run a "correct parity" check ...

Yes -- you certainly don't want to leave your array with parity incorrect !!

... and then a "nocorrect" to verify?

Your choice. I'd just run another correcting check to verify. [but I never run non-correcting checks -- if my parity's got errors, I want them fixed]

do i just let it be?

Absolutely NOT !!

Is this something to worry about?

No -- unless you start getting errors every time you check the array.

First I want to separate THIS user's issue with another possible scenario where a user has unexpected parity errors in the data area of the drive not associated with a dirty shutdown. I agree that THIS user should run a correcting parity check (if their initial parity check was a non-correcting check). With two parity errors in the housekeeping area, I think it very unlikely that there is any corruption. He has not confirmed there was a dirty shutdown but that remains my assumption. And even if this is not very satisfying to him, there is nothing else he can do unless he is collecting MD5 sums, creating PAR blocks, or has a backup. All relevant questions but which I am assuming the answer is no.

So the question arises, when would you NOT want t run a correcting check?

If a user were to run a non-correcting check and found unexplained parity errors with no dirty shutdown, and if that user had md5 checksums, I'd recommend running a md5 comparison before doing anything else. If that user found that one disk had corrupted files, while all of the other disks were fine, I would not suggest to run a correcting check. In such a case, I would remove the problematic disk, and start the array. In this configuration the removed disk would be simulated using parity. The user could run his MD5 checks on the simulated files that were corrupted on the physical disk, and if the errors are gone, then clearly parity was good while the physical disk was corrupted. The user could then rebuild the disk (I'd suggest using a freshly precleared new disk, leaving the old disk for heavy diagnosing or possible RMAing later on).

I think it is a bit funny that people want to be protected from a double disk failure, a virtual statistical impossibility, but are willing to accept the risk of something like you have encountered, which although rare, does happen with some regularity here.

I don't think anyone's "willing to accept the risk of something like you have encountered". I DO think a lot of folks aren't prepared to confirm the integrity of their data -- either via checksums or with backups they can compare to.

By the way, while it's certainly a fairly low risk, dual-drive failures are far from "... a ... statistical impossibility ..." with the size of many folks UnRAID systems. Modern consumer-oriented disks have non-recoverable read error rates on the order of 1 in 10^14 bits read => or one uncorrectable bit for every 12.5TB read. So it's actually quite likely there will be a bit error during reconstruction of a disk in a moderately large array. A dual parity system will, of course, mitigate this, as it will correct any of these errors.

As you know from my comments in my Backups thread, I think it's surprising folks go to the trouble of building extensive media collections; the expense of storing it on a fault-tolerant server; and then don't think it's worth the bother or expense to back it all up. As an absolute minimum, you should maintain checksums of your data, so if you get a parity error, you can confirm whether or not any of your data was corrupted. Knowing a file is bad isn't, of course, much use unless you have a backup to restore it from; but at least you'll know what's good & bad.

I won't engage in another "is it worthwhile to backup a media server" debate. There are pros and cons, and I certainly can't disagree that if data integrity is you sole motivator, then backups are the way to go. Some might say they are willing to live with a slight risk of data loss rather than the effort and expense of a full backup solution, and I could not call that a bad decision either. It all depends on your tolerance for the risk. But I do agree 100% that maintaining MD5s, a virtually free capability, is something everyone should do!

For those of us that do not have a backup (which is most everyone), you owe it to yourself to protect yourself from the most likely causes of data loss and corruption. unRAID is great and protects you from a single failed disk. A single failed disk is a reasonable likely scenario and it does a great job protecting you. It is worthwhile to monitor the smart statistics of your drives and know the telltale signs of early failure. Running monthly parity checks is also a good idea. (It was my suggestion many many moons ago to do this and it seems to have caught on). I believe most users do these things. And it prevents a bunch of issues before they happen.

But now we need to take on the next level of protection to protect against issues resulting from very common scenarios. I'll call them "server problems". Be it a bad cable, a loose case, a bad splitter, faulty driver, bad port on a controller, memory error, ... all situations that can cause problems and lead to corrupted data. How you go about solving those problems can wreck havoc. Knowledge is key to protecting yourself from yourself. I would rate this as the most important way to protect yourself, is to understand unRAID, parity, and how to recover. Know that immediately after an event is not the right time to try a drastic procedure. Asking for help in the forums is great, but it does not replace your own head evaluating the advice and making the best decision for you. It may surprise people to know that the most common source of lost data is due to taking the wrong step to solving a problem that could have been easily solved with the right know how.

So being educated is important to protect yourself from shooting yourself in the foot, but it can still happen, even to the best of us. And you don't have to shoot yourself in the foot to have a server problem that may have caused data corruption. So if you do find signs of corruption, either because you did something stupid ill-advised trying to recover from a problem, or something unexplained happened due to no fault of your own and you have or fear data corruption, you should have md5 checksums in a protected place to be able to detect corruption. Having 20T of data and finding that you have 5 corrupt files that are lost forever for lack of a backup is, IMO, far better than having 20T of data and having no idea how many files (if any) are corrupt and depending on serendipity to find them.

Real-world issues resulting from server problems are so much more common and likely than the dual failed disk scenario. Can 2 drives file at the same time? Yes. Taking 20 disks off of an assembly line and making an array from them, it is not out of the realm that two (or more) of them would fail very close together. Disk diversity is therefore a help. But I have been monitoring these forums since 2008 and can't remember seeing a case where a user really had two fail at the same time. Dual parity will likely come with some significant compromises - including reduced performance, simultaneous spinup of all drives, extra cost, etc. It is not a slam dunk I would run it, because what it protects me from is so unlikely. But that's another topic for another day. Dual parity does not currently exist as an unRAID option.

But I would love to see some questions being posted and a plugin author create some scripts to help users easily make and maintain an md5 (or similar) database that can be loaded and silently maintain these checksums. Load it, forget it, and have it maintain these on new files late in the night, similar to the mover script. Maybe have a monthly md5 check to verify integrity of all files on the array. This would be so easy and so useful I can't understand why people aren't asking for it ...

Quote

April 2, 201412 yr

But I would love to see some questions being posted and a plugin author create some scripts to help users easily make and maintain an md5 (or similar) database that can be loaded and silently maintain these checksums. Load it, forget it, and have it maintain these on new files late in the night, similar to the mover script. Maybe have a monthly md5 check to verify integrity of all files on the array. This would be so easy and so useful I can't understand why people aren't asking for it ...

+1

This would be great addition to unRAID.

Quote

April 2, 201412 yr

But I would love to see some questions being posted and a plugin author create some scripts to help users easily make and maintain an md5 (or similar) database that can be loaded and silently maintain these checksums. Load it, forget it, and have it maintain these on new files late in the night, similar to the mover script. Maybe have a monthly md5 check to verify integrity of all files on the array. This would be so easy and so useful I can't understand why people aren't asking for it ...

+1

This would be great addition to unRAID.

Use this script as a start - maybe?:

http://lime-technology.com/forum/index.php?topic=28168.msg255931#msg255931

Quote

April 2, 201412 yr

Author

I DO think a lot of folks aren't prepared to confirm the integrity of their data -- either via checksums or with backups they can compare to.

there is nothing else he can do unless he is collecting MD5 sums, creating PAR blocks, or has a backup. All relevant questions but which I am assuming the answer is no.

correct - GUILTY

even more guilty in the fact that I have read garycase's backups thread and thought to my self "yeah I should do something like that" even so far as putting a list together of parts for a backup server build but ignoring the easiest first step in the process of checksums.

no dirty shutdown, maybe??. 19 days ago we had a power outage but I assume the apcups plugin installed through unmenu did what it was supposed to do. When the power was restored 12 hours later the server rebooted just fine and no parity check was automatically started which is what I thought unraid would do by default if dirty shutdown was the case. ?? guess i should look at the log file to see if that was indeed the case.

started "correct parity" check before leaving for work today, should be finished by the time i get home and will start another to check again if there are no problems reported.

But I would love to see some questions being posted and a plugin author create some scripts to help users easily make and maintain an md5 (or similar) database that can be loaded and silently maintain these checksums. Load it, forget it, and have it maintain these on new files late in the night, similar to the mover script. Maybe have a monthly md5 check to verify integrity of all files on the array. This would be so easy and so useful I can't understand why people aren't asking for it ...

I'm asking now!

Thanks for the help and guidance.

Lesson learned

Quote

1st time for parity errors

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)