Partial / Incremental Parity Check


20 posts in this topic Last Reply

Recommended Posts

I'd like to bring up the idea I had previously that it would be nice to be able to schedule a chunk of check every day rather than one big one every month say.

 

This way means that if I wanted I could have 1/7th of the array checked everyday after mover runs, the benefit being:

 

A) Your parity is not more than 7/14/21 days since it was last checked.

B) Reduced stress on the drives (some being spun up already + less time at full load)

C) reduces the chance that your accessing the array at the same time. My parity check

On a 2TB array takes 6 hours, on a 6TB array your going on for a full day.

 

Bottom line is, doing parity checks all in one go can't go on!!

Link to post

I'd like to bring up the idea I had previously that it would be nice to be able to schedule a chunk of check every day rather than one big one every month say.

 

This way means that if I wanted I could have 1/7th of the array checked everyday after mover runs, the benefit being:

 

A) Your parity is not more than 7/14/21 days since it was last checked.

B) Reduced stress on the drives (some being spun up already + less time at full load)

C) reduces the chance that your accessing the array at the same time. My parity check

On a 2TB array takes 6 hours, on a 6TB array your going on for a full day.

 

Bottom line is, doing parity checks all in one go can't go on!!

 

I like this idea.  Maybe select a  disk at a time ?  i.e. On a 6+1 disk array you could schedule a different disk every night, leaving Sunday as the day of rest :)  Or a couple of disks a week.

 

I don't k now if that is technically feasible but with the size of arrays these days it's getting hard to find a window with little activity to give parity

Link to post

Bottom line is, doing parity checks all in one go can't go on!!

I think the impact of parity checks depends on the hardware you have?  I find on my system that array access is fine even with parity check running.    Access IS slower, but not noticeably so, and I can be streaming video with no stuttering while running a parity check.

 

I think that the critical thing is whether the disk controller(s) can sustain maximum throughput from all connected drives simultaneously.  As long as the SATA connections are on the motherboard or a disk controller in a high-speed PCI Express slot then impact is minimised.  If you are using a slower slot for the disk controller then performance can be badly constrained.  On my system the non-motherboard disk controllers are plugged into PCI-Expressx16 slots.

Link to post

I like this idea.  Maybe select a  disk at a time ?  i.e. On a 6+1 disk array you could schedule a different disk every night, leaving Sunday as the day of rest :)  Or a couple of disks a week.

 

I don't k now if that is technically feasible but with the size of arrays these days it's getting hard to find a window with little activity to give parity

Not technically feasible.  By its very nature a parity check involves reading from all drives simultaneously.

Link to post

I like this idea.  Maybe select a  disk at a time ?  i.e. On a 6+1 disk array you could schedule a different disk every night, leaving Sunday as the day of rest :)  Or a couple of disks a week.

 

I don't k now if that is technically feasible but with the size of arrays these days it's getting hard to find a window with little activity to give parity

Not technically feasible.  By its very nature a parity check involves reading from all drives simultaneously.

 

This is a good thing folks, You want to exercise every drive and every sector on a reasonable basis or you could have problems that are not apparent.

 

B) Reduced stress on the drives (some being spun up already + less time at full load)

 

These drives are meant to be used. Doing a parity check is not stressful for the drives.

It's actually pretty gentle. Each drive is read in round robin a sector at a time in a sequential method.

It's a gentle continuous read.

 

The stress is probably more on the power supply, bus, any potential cable interference or vibration issues that may crop up. 

 

What's probably more useful is an autotuned parity checks.

i.e. Parity check/sync "auto-throttling" feature.

http://lime-technology.com/forum/index.php?topic=34656.msg322061#msg322061

 

When I used RAID-1 for my arrays years ago, this was a big help during a crash and recovery.

I could set a minimum and maximum throughput which would allow me to work and check at the same time.

Once I stopped working the recovery effort would speed back up to maximum.

 

Since the check is a read mostly operation, auto tuned/auto-throttling would go a long way to set it and forget it without interference of day to day operations.

Link to post

I like this idea.  Maybe select a  disk at a time ?  i.e. On a 6+1 disk array you could schedule a different disk every night, leaving Sunday as the day of rest :)  Or a couple of disks a week.

 

I don't k now if that is technically feasible but with the size of arrays these days it's getting hard to find a window with little activity to give parity

Not technically feasible.  By its very nature a parity check involves reading from all drives simultaneously.

 

This is a good thing folks, You want to exercise every drive and every sector on a reasonable basis or you could have problems that are not apparent.

 

B) Reduced stress on the drives (some being spun up already + less time at full load)

 

These drives are meant to be used. Doing a parity check is not stressful for the drives.

It's actually pretty gentle. Each drive is read in round robin a sector at a time in a sequential method.

It's a gentle continuous read.

 

 

 

I disagree. My drives run hotter when running a parity check, which in my book means they are under more stress. However, as we are never going to prove the statistics of drive health vs usage pattern, I'm happy to let the other two points push the idea forward.

 

I'd much prefer that I can check a portion of my drives at a time, especially straight after a mover run (which for me is 1AM) as some/all of the drives are already spinning.

Link to post

Well there is ample evidence that within reasonable ranges temp and reliability have zero correlation.  On top of that temp != stress ... temp = inadequate cooling.  Of course heat generation goes up, but that also doesn't equal stress.  If you feel your drives exceed the reasonable range during a parity check I suggest addressing cooling and NOT limiting drive activity.

 

Checking a portion after a move won't do anything unless the checked section happens, by complete luck, to completely encompass the newly written files.  As already mentioned, you can't pick and choose drives to check.  You can at best pick a fraction of the array held by every drive on the array (i.e. every drive's first third, then second third, then third third, or any fraction you desire).

 

Mind you I do think the idea of picking a fraction (maybe 1/7th) or run time (1 hour) and running that every day sequentially across the entire array might have some benefits.  Especially for those with large enough drives that a full parity check can be 12+ hours.  For me my 6TB array with 2TB drives takes ~6 hours.  I won't be running that every single day or week.  But I'd like each bit checked more than every month.  If I could do 1/7th every night (at an unobtrusive ~50 min each run) I'd know my full parity was scanned every week and no single bit has gone unchecked longer than a week.

Link to post

Partial parity checks has merit to some, however, there are other health checks that need to be considered too.

 

While I ran a parity check, I still had a dual drive failure within an hour after the successful parity check.

The parity check was successful, but what I did not know is there was a pending sector.

 

The only way to find those are SMART Long tests and some kind of front end indication on the emhttp interface.

The SMART long test still requires you to traverse through the whole drive.

 

That is a good partial test. A single drive surface scan, and a scheduled full parity check should be enough.

Doing partial parity checks is going to require more work in storing the start/end.

Maybe a small mod to the md.c driver to indicate a start position.

 

Then some cron job can read a cached value, monitor it and store the current position when a stop is requested.

 

Still, that's quite a bit of work that address a perceived drive health risk, when it still doesn't reveal the crucial issue of a pending sector.

Link to post

Partial parity checks has merit to some, however, there are other health checks that need to be considered too.

I am all in for additional health checks too. Would like to see a system whereby unRAID is doing regular checks in the background and give notifications for anything pending or needing attention. It may be combined with interpreting SMART reports and deduct sensible conclusions for users to act upon.

 

 

Link to post

Well there is ample evidence that within reasonable ranges temp and reliability have zero correlation.  On top of that temp != stress ... temp = inadequate cooling.  Of course heat generation goes up, but that also doesn't equal stress.  If you feel your drives exceed the reasonable range during a parity check I suggest addressing cooling and NOT limiting drive activity.

 

Checking a portion after a move won't do anything unless the checked section happens, by complete luck, to completely encompass the newly written files.  As already mentioned, you can't pick and choose drives to check.  You can at best pick a fraction of the array held by every drive on the array (i.e. every drive's first third, then second third, then third third, or any fraction.

 

Firstly, I was mearly pointing out that my drives run hotter during a parity check than any other read/write operation. Not that they are exceeding a acceptable temperature range.

 

Secondly I'd like to clarify that each night 1/7 of the array is checked. So first night does 0-14.5%, second night 14.5% to 29% and so on.

 

I.e. That over the course of a user selectable timescale, a complete parity check is done.

 

My point about mover was that I'd prefer to run it then as there is a good chance some of the drives are already spinning, nothing more.

Link to post

While clearly when you use a drive continuously it's going to get a bit warmer; that does NOT mean it's being "stressed".    A parity check is, from the drive's perspective, a very low-stress operation.  Virtually no seeks (just one-cylinder-at-a-time of head movement); and the entire cylinder is read at once for every cylinder.

 

There's really no reason to limit this to a small % of the check ... when you want to do a parity check, just do it.    Note that if you did what's suggested here (i.e. run perhaps 1/7th of a check after each Mover run) it means that ALL of the drives would have to spin up after the mover ran each night.    In general, only a few drives would normally be spinning at that point -- the cache, parity, and those data drives that had data copied to them (normally this would probably only be one or two data drives).

 

The one thing that would be nice about the concept is that this "1/7th" of a check would typically only take an hour or so ... minimizing any disruptions to normal use of the server that a long parity check might otherwise cause.    But that's easily resolved by simply scheduling your parity checks to run overnight.

 

 

Link to post

A similar concept that has been discussed is an intelligent parity check that would pause when there is other disk activity on the protected array.

 

Parity check/sync "auto-throttling" feature

http://lime-technology.com/forum/index.php?topic=34656.0

 

The auto-throttling feature is something that is/was in the standard linux kernel raid modules.

I've had great success with it.

Link to post

Yes, that would be very nice ... and it's already on the list to do.

 

WeeboTech => Do you already have this set up?    If so, is it just an option in the Go script?

 

Auto-throttling no. It's part of the standard linux md raid drivers though. It's been there for years.

 

It's not in unRAID yet.

 

Link to post

A similar concept that has been discussed is an intelligent parity check that would pause when there is other disk activity on the protected array.

 

Parity check/sync "auto-throttling" feature

http://lime-technology.com/forum/index.php?topic=34656.0

 

The auto-throttling feature is something that is/was in the standard linux kernel raid modules.

I've had great success with it.

 

+1

Link to post
  • 4 weeks later...

Some of us live in the tropics and have servers with 6TB drives taking 16 hours to do a parity check.

 

I don't like running parity checks in the heat of the day, I would prefer to start it at night, stop it in the morning before the temperature rises 20 degrees F and restart it from where it left off at night again.

Link to post

I'm not sure whether this is the correct place to raise this issue, but it is closely related ....

 

I experience so many power cuts that it is sometimes impossible to complete a check after an unclean shutdown before the next power cut occurs.  Currently, in this situation (assuming a clean shutdown during the parity check), the system comes up as though the file system is now clean.  The system gives no indication that the parity check was incomplete, merely reporting the time it started and how many errors were detected/corrected.

 

Ideally, what should happen is that the parity check be reported as incomplete, and it should restart from where it left off.  Of course, if the second shutdown was unclean then, as now, the parity check should restart from the beginning.

Link to post

I'm not sure whether this is the correct place to raise this issue, but it is closely related ....

 

I experience so many power cuts that it is sometimes impossible to complete a check after an unclean shutdown before the next power cut occurs.  Currently, in this situation (assuming a clean shutdown during the parity check), the system comes up as though the file system is now clean.  The system gives no indication that the parity check was incomplete, merely reporting the time it started and how many errors were detected/corrected.

 

Ideally, what should happen is that the parity check be reported as incomplete, and it should restart from where it left off.  Of course, if the second shutdown was unclean then, as now, the parity check should restart from the beginning.

That sounds like a bug.

 

Even if it does not currently start from where it left off, it should NOT mark the parity as "clean" until AFTER it completes a parity check.

 

It sounds as if it is marking it "clean" at the start of a parity check.

Joe L.

Link to post

Actually Joe, it's done that for a long time.  If you start a parity check; then immediately cancel it; it will show the last parity check date as when  you just started it -- and no errors in parity.

 

Technically that's somewhat correct -- you DID run a parity check (albeit an incomplete one); and it didn't find any errors  :)

 

... I agree the details could be more complete (i.e. it should show that the last parity check was not completed).

 

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.