[Plugin] Parity Check Tuning


Recommended Posts

Hi, love the plugin, thank you.

I have an issue where if I have an unclean shutdown and and unscheduled parity check starts, it does not pause unless I pause it or until the designated time to pause it occurs.

For example, if I have an unclean shutdown at 18:00, the server will boot back up and the parity check will run until 05:30 when I have it pause, but parity checks should only be happening from 01:15-5:30 causing the parity check to run during prime time.

 

Any ideas how to solve it?

 

image.thumb.png.f9e31815ea507ef42660e7e878ab50e9.png

 

 

Link to comment

I am afraid that at the moment that is expected behaviour.    The normal expectation is that you want such checks to complete as soon as possible to get the array back to a consistent state.  If you do not want this then you are expected to do a manual pause so that the scheduled pause/resume of increments can kick in.   I cannot think of a robust way of the plugin doing anything different but am open to suggestions.

 

The plugin DOES detect that an unclean shutdown has occurred but at the moment this is only being used (in the next release of the plugin) as a condition meaning any running array operation cannot be restarted from its position previously reached when the array was stopped.

Link to comment
9 minutes ago, itimpi said:

I am afraid that at the moment that is expected behaviour.    The normal expectation is that you want such checks to complete as soon as possible to get the array back to a consistent state.  If you do not want this then you are expected to do a manual pause so that the scheduled pause/resume of increments can kick in.   I cannot think of a robust way of the plugin doing anything different but am open to suggestions.

 

The plugin DOES detect that an unclean shutdown has occurred but at the moment this is only being used (in the next release of the plugin) as a condition meaning any running array operation cannot be restarted from its position previously reached when the array was stopped.

This seems to me to be the way it should work with unclean shutdown, including the last paragraph where you make it start over.

Link to comment
44 minutes ago, trurl said:

This seems to me to be the way it should work with unclean shutdown, including the last paragraph where you make it start over.

Glad to get agreement on this :)

 

The basic assumption I am working to is that automated actions involving the whole array should only be attempted when it appears to be completely safe.  If in doubt avoid the automated action and get the user to take any needed action.  After all running automated array operations when in doubt and causing any sort of data loss as a consequence would be deemed completely unacceptable in my view :( 

  • Like 1
Link to comment
2 hours ago, Candle said:

I would like the option to immediately pause an unscheduled parity check and follow the schedule,

You certainly can cancel the parity check if you wish, but be aware that if you have a drive failure while parity is out of sync, the rebuilt drive will have bit errors. Often times that results in data loss.

Link to comment
14 hours ago, jonathanm said:

You certainly can cancel the parity check if you wish, but be aware that if you have a drive failure while parity is out of sync, the rebuilt drive will have bit errors. Often times that results in data loss.

Yes, I am aware of the risks, but would like the option within the scheduler to be able to do it.

Link to comment
3 hours ago, Candle said:

Yes, I am aware of the risks, but would like the option within the scheduler to be able to do it.

You can also pause it manually and parity-tuning will handle the resume.

If unclean shutdowns are so frequent the extra click is burdensome you've got bigger problems :)

Link to comment

At the moment I am not prepared to implement an option that would auto-pause the parity check that happens after an unclean shutdown.

 

the implementation I am currently testing will auto-pause a restarted array operation that was paused at the time of a shutdown, but that will only happen after a clean shutdown.    As soon as an unclean shutdown is detected then the decision is to err on the side of safety.

 

if I get convinced that an auto-pause of the automated check after an unclean shutdown is a feature would be desirable then it could be added but it is not going to be in the next release I make.

  • Like 2
Link to comment
22 hours ago, Candle said:

This tool is there for convenience. I am just asking for one feature that I think would be a good one to add. Thanks for the lecture though.

Sorry about that. I meant it sincerely though, an unclean shutdown (and subsequent parity check) should be a big deal. Auto-pause mitigates one small consequence of that - the larger consequence is potential data loss.

Link to comment
  • 2 weeks later...

I have released the first version of the plugin that supports restarting array operations after stopping the array or after a shutdown/reboot sequence as long as it was a clean shutdown of the array.   You need to be on Unraid 6.9.0-rc2 for this option to be available to you in the plugin settings.  On earlier releases you will see the entry for activating the restart facility in the plugin settings but will not be able to change it to "Yes".  I had considered hiding this entry on such systems but decided to leave it visible to make it clear that you need a newer version of Unraid to use it.

 

Hopefully I have not introduced any regressions that has broken any previously existing functionality.  If you notice any issues then please report them so that I can attempt to fix them.   One oddity I have noticed is that the you can get some spurious notifications from the built-in Unraid parity check reporting and I have not (yet anyway) found a way to suppress these.

 

The plugin will also output a notification on all Unraid releases if it detects that there is an automatic parity check triggered by Unraid as a result of an unclean shutdown.  Hopefully as well as being generally useful it will make it clear why any array operation was not restarted.

Link to comment

The plugin does not seem to honor the tuning schedule. I have my parity check to pause at 4:30AM, but when I woke up today the parity check was still running.

I can see the following in the log at the time it was supposed to pause,

 

2021-01-02 04:30:01 5 unraid01 CROND exit status 126 from user root /usr/local/emhttp/plugins/parity.check.tuning/parity.check.tuning.php "pause" &> /dev/null

Edit: 4:30AM not PM.

Edited by makkish
Link to comment
12 hours ago, LateNight said:

Repeated messages in log, every seven minutes:

crond[2212]: exit status 126 from user root /usr/local/emhttp/plugins/parity.check.tuning/parity.check.tuning.php "monitor" &>/dev/null

Strange - that implies execute permission is not present on files used internally by the plugin.  I have just checked and if I remove and then re-install the plugin I can see the permissions are not as expected.  Not sure why that is suddenly an issue but I should be able to easily fix it by explicitly setting them as part of the plugin install processing.   

 

 

 

Link to comment

Now that the plugin is released with restart of array operations capability I am wondering if there is a sensible Use Case for allowing a partial parity check operation starting from an explicitly given point?   Perhaps something like run the check from 20%-30% of the normal check range.  

 

At the moment I do not intend to implement such a feature, but it is technically possible so I thought I would at least float the idea to see what others thought and what reservations there would be about such a feature being misused.  At the moment this is just a thought experiment (and not any sort of commitment) to try and get feedback.

Link to comment
8 minutes ago, JorgeB said:

Yes, I would like this, and the starting point ideally could be given as a percentage or sector.

What I am trying to assess is the pros and cons of providing such a feature.   In particular how it might be misused in a way that could lead to data loss.  If I DO implement it I would give positions as a percentage rather than a sector number.

Link to comment
7 hours ago, itimpi said:

If I DO implement it I would give positions as a percentage rather than a sector number.

Why no sector number? My use case would be to repeat a failed sector and see if it still fails, if so, flush disk buffers and try again. Perhaps we could figure out a way to better differentiate between a flaky data path (controller, RAM, drive) and a genuine bit error that should be corrected.

 

If a correcting parity check could be initiated ONLY IF and ONLY ON sectors that repeatedly pass the same bit error a specified number of times, it would make more sense to me than blindly writing possibly random parity. Yes, that would make correcting checks take much longer, but only on the incorrect sectors. You could also put in logic that could error out the correcting check if 100% of a configurable number of sectors were consistently wrong, and prompt to do a parity build instead.

7 hours ago, itimpi said:

In particular how it might be misused in a way that could lead to data loss.

As long as parity is not shown as fully checked when there are possible out of band modifications, I think you are fine. IOW, I wouldn't want to think parity was checked fully intact if the server was powered down between partial passes. If the array is stopped, the parity check percentage should reset.

 

If you have it set to incrementally check 25% each day, but stop the array and restart it, I don't think that should count as having checked 100% of parity until you have 4 consecutive checks that weren't interrupted by an array stop.

 

This is more of a reporting and array confidence thing than a data loss scenario though. I can see someone thinking it would be ok to only check 10% of parity each month and expect to have a flawless rebuild, when in reality that could mean they haven't fully checked parity in almost a year.

Link to comment
25 minutes ago, jonathanm said:

If you have it set to incrementally check 25% each day, but stop the array and restart it, I don't think that should count as having checked 100% of parity until you have 4 consecutive checks that weren't interrupted by an array stop.

This can sort of be done with the current increments capability - just that it the bands are time limits rather than percentage.

 

26 minutes ago, jonathanm said:

Why no sector number? My use case would be to repeat a failed sector and see if it still fails, if so, flush disk buffers and try again. Perhaps we could figure out a way to better differentiate between a flaky data path (controller, RAM, drive) and a genuine bit error that should be corrected.

I feel something like this should be in a different plugin that is specifically geared to testing disks.    Then it makes sense to have specific sectors specified.   Adding it to the current plugin seems a bit off-topic to its general purpose and potentially confusing to many users.   In addition although I can easily start a check at a specified offset it is not easy to stop at a defined point with any sort of accuracy.

 

I am thinking of adding a column to the parity history that shows what percentage of the disk was checked on each record so it becomes clearer when a parity check is quickly abandoned or if it gets aborted for any reason (including unclean shutdowns).

33 minutes ago, jonathanm said:

If a correcting parity check could be initiated ONLY IF and ONLY ON sectors that repeatedly pass the same bit error a specified number of times, it would make more sense to me than blindly writing possibly random parity. Yes, that would make correcting checks take much longer, but only on the incorrect sectors. You could also put in logic that could error out the correcting check if 100% of a configurable number of sectors were consistently wrong, and prompt to do a parity build instead.

I do not think that the information required to implement something like this is readily available to the plugin.    It feels like it would require support right at the md driver level that is not currently there.  Still something to think about.

  • Thanks 1
Link to comment
20 minutes ago, itimpi said:

I do not think that the information required to implement something like this is readily available to the plugin.

Parsing the syslog is what I had in mind. Start a non-correcting check, watch the syslog for parity errors. Restart the check at the first error sector, if error is identical, force the drives to flush any on drive cache and restart again at the error. If the same sector errors, run a correcting check for just that segment, then start another non-correcting check at the same spot.

 

If the errors don't repeat on rechecks, don't correct.

Link to comment

Something strange...  I updated the plugin three days ago to the current version (2021.01.03) (main server running 6.9.0-rc2).  I did not touch the settings at that time which were to use increments, starting at 01:00, pausing at 08:00, repeat until done.  Last night (first Wednesday in the month) the check started.  The check was still running at 10:00 (should have paused at 08:00) and finished at 12:42.  There was nothing in the syslog at the time that it should have paused, although I did not have debug logging enabled for the plugin. 

 

I did notice that the notifications for start and end of the check were duplicated.  Previously I only ever had one of each.  I also received two emails telling me that the check had finished - one titled "unRAID Status: Notice [TOWER-V6] - Parity check finished (0 errors)", the second titled "unRAID Status: Non-Correcting Parity Check finished (0 errors)".  Previously I only had the usual start and finish emails. 

 

Would it be better to reinstall the plugin and try again, or would it be helpful if I could capture some debug logs?  Or both?

Edited by S80_UK
Link to comment

Since I need to work on the server later in the week I decided to try uninstalling and re-installing the plugin - it had previously been updated on multiple occasions. Uninstall was uneventful.  I decided to reboot the server before a clean install of the plugin.  However, for reasons unknown the server rebooted and went straight into a parity check.  I used the normal Reboot button in the web UI so I don't believe it should have done that.  Anyway, I will now let that run through until tomorrow and then I have a couple of hard disks to swap around for a capacity upgrade, so I shall get back to this in a day or so.

Link to comment
On 1/6/2021 at 1:18 PM, S80_UK said:

I did notice that the notifications for start and end of the check were duplicated.  Previously I only ever had one of each.  I also received two emails telling me that the check had finished - one titled "unRAID Status: Notice [TOWER-V6] - Parity check finished (0 errors)", the second titled "unRAID Status: Non-Correcting Parity Check finished (0 errors)".  Previously I only had the usual start and finish emails. 

This is quite normal in the current version.    If you look carefully you will see one has come from the built-in UnRAID support while the other from the plugin.    The UnRAID one only covers the last increment and so is typically wrong for the duration and speed of the whole check.  The plugin version correctly takes into account it was run in increments and thus gives the correct duration/speed as well as providing additional information.   I have not (yet anyway) found an easy way to suppress the built-in one.

 

On 1/6/2021 at 1:18 PM, S80_UK said:

The check was still running at 10:00 (should have paused at 08:00) and finished at 12:42.  There was nothing in the syslog at the time that it should have paused, although I did not have debug logging enabled for the plugi

If you last tried changing the settings using the 2021-01-01/02 versions then there was a permissions issue that would mean the pause/resume would not have been correctly scheduled to run.   Redoing the settings using 2021-01-03 version will correct this.   If it does not then I would welcome a syslog covering the problem period with at least Debug logging set (Testing level is even better but more volumous).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.