[Plugin] Parity Check Tuning


Recommended Posts

On 2/13/2021 at 11:25 AM, weirdcrap said:

I did have "Send notification for temperature related pause" set to NO, but I assume it would have still logged the pause and reason in the logs?

The release I have just pushed will create syslog entries for pause, resume and completion regardless of whether notifications are enabled.   Please let me know if there are some syslog entries that are not there that you think it would be a good idea to have present.

Link to comment

Hi!  I am happy to give a little feedback after a couple of passes through the process with this new version.

I set up a scheduled check earlier in the week which paused and resumed perfectly. That was a couple of days ago.

 

Yesterday, I had an unplanned user created hardware issue - my HBA went off-line taking six drives out of the array at once.  This was a mechanical problem caused by the card not sitting properly in the PCIe slot.  I shut down the array, powered down and attended to the mechanical issues.  Everything came up OK, and since I had shut down first, the array seemed good - Unraid did not try to run an automatic parity check. But due to the nature of the problem I felt that a parity check would be a good idea.  So I started a check just after midnight.  I did not have pause/resume enabled for manual checks.

 

Two things then happened.  At 01:28 the test paused due to an over-heating drive (it was at 42 degrees - a bit hotter than normal). At 02:00 it had cooled and the check resumed.  By chance, 02:00 is also when I have the test start and resume if it's a scheduled check.  The test then ran OK until 08:00 when it paused.  However, the settings were such that it should not have paused for a manually run test.     I manually resumed at around 10:35 this morning - I am just waiting for that to finish.

 

So, all went well, but the manual parity check paused at 08:00 when it should not have (set to no pause for manual checks).  I am wondering if this may have been a consequence of the pause and resume caused by an over temperature event several hour earlier.  Perhaps a "pause enable flag" or something got set by this?

 

From my point of view this is not a critical problem, but it looks as though something in the logic is still not quite right.

 

I did not have test or debug logging running for this - I was more concerned with checking the array after my HBA issue earlier the previous day.  So far, it appears that my hardware issue did not cause any disk corruption.   I will be happy to try again if it helps.

 

Cheers.   

Edited by S80_UK
Link to comment

@S80_UK Thanks for that detailed feedback.   It does sound as if the temperature handling may now be working as intended although I did not knowingly find and fix a bug in this area.

 

I will look into the unexpected pause.   The plugin has some heuristics to try an decide if a check is scheduled or not and it is always possible the sequence of events you describe caused something to not work quite as expected.  

Link to comment
2 hours ago, itimpi said:

@S80_UK Thanks for that detailed feedback.   It does sound as if the temperature handling may now be working as intended although I did not knowingly find and fix a bug in this area.

 

I will look into the unexpected pause.   The plugin has some heuristics to try an decide if a check is scheduled or not and it is always possible the sequence of events you describe caused something to not work quite as expected.  

Thanks for the reply.  The resumed check completed without errors.  I may try to repeat the process tonight, but with the temperature threshold set a little higher to avoid an unplanned pause.  I can then try again after that with a lower threshold to see what happens.   I shall let you know how it goes.     

Link to comment
1 hour ago, S80_UK said:

Thanks for the reply.  The resumed check completed without errors.  I may try to repeat the process tonight, but with the temperature threshold set a little higher to avoid an unplanned pause.  I can then try again after that with a lower threshold to see what happens.   I shall let you know how it goes.     


I suspect that the problem is the plugin getting confused about whether it is a scheduled check or not but confirmation would be nice.   If you do get anything unexpected can you grab the files from the plugins folder on the flash drive (except the .tgz one)  as seeing which files are present and what is in them may help with pinning down the cause.

 

Link to comment
4 hours ago, itimpi said:


I suspect that the problem is the plugin getting confused about whether it is a scheduled check or not but confirmation would be nice.   If you do get anything unexpected can you grab the files from the plugins folder on the flash drive (except the .tgz one)  as seeing which files are present and what is in them may help with pinning down the cause.

 

Understood - I will grab what I can.

Link to comment
15 hours ago, S80_UK said:

Understood - I will grab what I can.

Hi @itimpi       So, last night I ran a manual check with the threshold temperatures set higher to avoid any temperature related pause.  That worked up to the point where a pause would be expected for a scheduled check.  At 08:00 the check was paused, even though this check was manual and no paused was expected.  Currently it is paused at about 73%.  I am attaching the files requested.  Also attached is a diagnostics capture (just now).   Please let me know if you need any more info.

parity.check.tuning.cron parity.check.tuning.progress parity.check.tuning.scheduled parity.check.tuning.cfg tower-v6-diagnostics-20210221-1146.zip

Link to comment

@S80_UK Thanks - that should be enough.   The fact that the parity.check.tuning.scheduled file is even present confirms that the plugin got confused :( There should have been a parity.check.tuning.unscheduled file present instead for a manually initiated check.  I may even change the .unscheduled file to be either .manual or .automatic so I can tell the difference between a manually initiated check and an automatic one started by the system after an unclean shutdown.

 

I an adding some consistency checks that will hopefully stop this happening in the first place but if it still does at least make it clearer where things are going wrong.   I am also going to add some date/time information to the files created to indicate the check type so that it points to the exact point in the syslog where things went wrong which should also help.  
 

Link to comment
On 2/21/2021 at 7:19 AM, itimpi said:

@S80_UK Thanks - that should be enough.   The fact that the parity.check.tuning.scheduled file is even present confirms that the plugin got confused :( There should have been a parity.check.tuning.unscheduled file present instead for a manually initiated check.  I may even change the .unscheduled file to be either .manual or .automatic so I can tell the difference between a manually initiated check and an automatic one started by the system after an unclean shutdown.

 

I an adding some consistency checks that will hopefully stop this happening in the first place but if it still does at least make it clearer where things are going wrong.   I am also going to add some date/time information to the files created to indicate the check type so that it points to the exact point in the syslog where things went wrong which should also help.  
 

 

I think I may have encountered the same/similar bug as @S80_UK.

I'm in the upgrading a parity drive - and successfully finished a parity rebuild yesterday morning onto the new drive (the plugin did not interfere with this operation) 

 

Following the rebuild I manually initiated a non correcting check late last night - with "Use increments for Unscheduled Parity Check" set to no.

 

I logged in just now and found it had paused at my scheduled pause time of 9:30 AM this morning. I resumed it and it is working along. While I do have the temperature threshold pause enabled - I can say with confidence that my drives did not reach the threshold (server is in a cold new england basement) And the only pause I see in my syslog was at 9:30 AM.

I also have a parity.check.tuning.scheduled file instead of .unscheduled

 

I'll add that based off my (hopefully correct) memory, previous manual checks never got paused by the plugin - and I did update the plugin a few days ago - not sure what version I upgraded from - sorry

parity.check.tuning.progress.save parity.check.tuning.scheduled parity-checks.log parity.check.tuning.cfg parity.check.tuning.cron parity.check.tuning.progress cam-nas-diagnostics-20210226-1803.zip

Link to comment

@camjo99 I can confirm that there is a bug in the current version where the plugin gets confused as to whether a manual check was started manually or as a scheduled check and ends up treating it as a scheduled check.  This typically (as you found) ends up with the check paused in the morning when it should not be.    I can see that you encountered this by the fact that the parity.check.tuning.scheduled file was present.

 

I believe that I have now resolved this in the version I have under test.   I have been treating it as a low priority fix since the ‘workaround’ is to simply resume the check manually,so,the,impact,on end-users is minimal.

Link to comment

Just pushed a release that should now correctly track whether a parity check is scheduled or manual (or an automatic parity check after an unclean shutdown) and correctly obey the related pause/resume settings.  At this point I "think" all outstanding issues have been resolved.

 

If any unexpected behaviour is encountered then please let me know.

 

As always open to suggestions for improvement.

 

  • Like 1
  • Thanks 1
Link to comment
43 minutes ago, theruck said:

can it be finally disabled not to run automatically after not clean reboot? its the most annoying thing really when after a power outage the array performance sucks for half the day

 

No, the automatic parity check after un unclean reboot is something that unRaid does independently of the plugin.  You have been able to manually pause/resume/cancel such checks for some time now.

 

The closest you can currently get using the plugin is to make sure that the option to pause/resume unscheduled parity checks is enabled, and then manually pause it on reboot and the plugin will then complete the check in increments (typically outside prime time) according to the schedule you have set for increments.   Of course the other option is to cancel the check but this is definitely not recommended as it introduces the chance of parity getting out of step with the array without you realising it.

 

I have thought of having a plugin setting that would automatically pause such a check without the user having to do the first pause manually after reboot but have avoided doing it as I would prefer the decision to do so to be an explicit decision by the user as unclean shutdowns should be an exception and not treated lightly.

Link to comment

well if your plugin schedules the parity check to run at specific time i really do not bother having a dirty parity until the schedule anyway. either the parity saved my data during the power outage or i am screwed anyway. the parity check running on boot does not help a single thing in this scenario. and your plugin existance just confirms it that people are annoyed by the parity check running at inapropriate times which on the boot is just always.

for home users this is crucial as a simple power outage creates low performance for home applicance services, vms, really anything. and if i am not at home there is just nobody who can click the stop parity check so the feature to cancel the current parity check and run onlyt the scheduled one would be really helpful

Link to comment
7 hours ago, theruck said:

well if your plugin schedules the parity check to run at specific time i really do not bother having a dirty parity until the schedule anyway.

A couple of comments if I may...  This plugin does not do the scheduling that you describe - that scheduling is a standard function of Unraid, as is the automatically run parity check after any unclean shutdown.  This plugin allows a standard check to be split into parts with controlled suspend and resume timings as needed to help limit the effects of system slowdown. 

 

If the automatic check after an unclean shutdown is an issue for you, you should perhaps ask Limetech about it, although I suspect they would be reluctant to make it possible to disable the check from starting under those conditions (it kind of defeats the purpose).  As noted above, you also do have the option to cancel a check, and with the plugin, you have the option to pause it. 

 

As for not worrying about dirty parity, that would depend on the value of your data and the frequency of your scheduled checks, as well as the reliability of any backup scheme that you might have.  Don't forget that dirty parity will prevent Unraid from correctly recovering a disk if it fails before the parity error is detected and corrected.  Personally, I would prefer to have the automatic check run after an unclean shutdown and restart.  Then I can make an informed decision on whether to abort or let the check complete.

 

As @jonathanm has commented - a UPS may be a better approach if you have frequent unscheduled restarts (unless they are due to causes other than mains power).

Edited by S80_UK
Link to comment

come on guys,

sorry for writing it into this plugins topic but your arguments are from the past.

UPS for small form factor home appliances is larger than the storage appliance itself and cost 25%-50% of the appliance price. And in case you did not know. it is 2021 and everybody is using journaled filesystems anyway. You are booting from USB so what is the big danger of having a dirty reboot here? The parity makes already enough performance drawdown so i bet that the parity consistency is already being solved on different level when writing.

 

imho if there is a dirty reboot for any reason, the automatic parity check after the reboot has no profits. I am happy to learn otherwise.

 

If the reboot was due to failing hardware other than disk, running 12 hours of parity check does not help the situation. You either have the data readable on the drive or not and stressing the system more before investigation is just more risky. See the multiple topics on sudden reboots here on the forum.

 

If the reboot was due to failing hard drive the parity check is useless as you have already a failing drive and you either have the data from the parity drive or you have data loss.

 

If you have more drives and one or none of them was being written during the dirty reboot, there is no need to rebuild the non-used drives parity.

 

If the reboot was due to software error, you will likely experience another dirty reboot during the parity check while you try to find out and re-test why the reboot happened.

 

If you have services like vms and dockers running and need to get the services up and running asap, the parity check just lowers the whole performance of the whole storage system. Again, contra-productive.

 

If you unplug your boot USB by mistake there is nothing wrong with your drives or parity but you get another parity check on the reboot.

 

"unclean shutdown is if the system has to forcibly kill processes because they refused to shutdown normally" - another silly reason to rebuild parity

 

Is there anyone on this forum not running the unraid in 24/7 mode at home who does not cancel the check after the dirty reboot and lets it run for days? Whenever this happens to me i have a ton other stuff to do rather than waiting for the parity check to complete. During parity check you cannot backup MACs or iphones effectively. Sometimes you cant even watch movies if they have too high rate for unraid to handle both at the same time. 

 

and last, with this type of parity consistency check, the moment your parity check stopped, you do not know if the pairty is valid anyway. From its behavior there is no difference between a parity running for weeks and a parity after a dirty reboot. You just need to run the check to know that your parity is ok. So you can only hope that your parity is ok even without dirty reboots. The system is just not aware of the parity status all the time!

 

I was hoping that this plugin could have a checkbox which would just cancel the check on reboot automatically and let it run only during the scheduled window time. That would be the proper way of handling the parity check for home use storage imho.

 

Anyway i am just curious if after every dirty reboot the parity is really 100% invalid if limetech made it the default action. What will happen if you do a dirty reboot and fail (one of) the data drive?

 

 

 

 

Edited by theruck
Link to comment
3 hours ago, theruck said:

everybody is using journaled filesystems anyway.

Unraid parity has no filesystem and isn't journaled in any way.

4 hours ago, theruck said:

the moment your parity check stopped, you do not know if the pairty is valid anyway

Exactly correct. But, as long as you cleanly shut down, and have no hardware failure that writes erroneous data, it's reasonable to trust that it is valid.

4 hours ago, theruck said:

What will happen if you do a dirty reboot and fail (one of) the data drive?

If all writes were complete and flushed out of RAM cache at the moment, absolutely nothing, parity would still be valid. If, however, there are uncommitted writes to any data drive, then the parity emulated failed drive will be corrupt at the address of those writes. Sometimes that address is in unused space so no effect, sometimes it's in the middle of a file, resulting in just one file being silently corrupted, sometimes it's in the filesystem resulting in an unmountable file system that may or may not be repairable.

 

Until a correcting check is run, Unraid has no way to know the parity disk is out of sync, so any errors will accumulate, but will not effect you until a data drive fails to read a sector and parity is used to recreate that sector.

 

The longer you wait to run a correcting check after an unclean shutdown, the greater your risk of corrupted data. However, if the cause of the crash is unknown, you need to figure out WHY, before you correct parity. No point in writing bad data if the crash was caused by a failing stick of RAM.

 

Unraid is not designed to handle unclean shutdowns gracefully, so if your power is prone to failing, you must use a UPS.

Link to comment
20 hours ago, itimpi said:

Just pushed a release that should now correctly track whether a parity check is scheduled or manual (or an automatic parity check after an unclean shutdown) and correctly obey the related pause/resume settings.  At this point I "think" all outstanding issues have been resolved.

 

Meanwhile, back on topic...  :)

 

Thanks for this - I shall test accordingly and let you have my feedback in the coming days.  I will start by uninstalling and clearing out any old files so that the new plug-in has a fresh start.

Link to comment
10 hours ago, jonathanm said:

Exactly correct. But, as long as you cleanly shut down, and have no hardware failure that writes erroneous data, it's reasonable to trust that it is valid.

isn't it the same case with unlcean shutdown? I still have my data on my ok drive with journaling filesystem so whenever it is suitable i want it to run the parity check. There is just no balance in it anymore. One unclean shutdown and half a day of unusable storage is just a too big price, especially if i know that the shutdown was not really so unclean and i know i was not writing to the drives any data i would need to be parity re-checked

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.