Jump to content
itimpi

[Plugin] Parity Check Tuning

125 posts in this topic Last Reply

Recommended Posts

Does this plugin's CLI commands still work? I'm on 6.8.3 and a parity.check pause does not seem to be working.

Share this post


Link to post
14 hours ago, wreave said:

Does this plugin's CLI commands still work? I'm on 6.8.3 and a parity.check pause does not seem to be working.

The CLI commands should work fine.   What actually happens when you try them?   You might get more information in the syslog if you have enabled debug logging for the plugin (although you probably do not want to leave it enabled if not investigating a problem with the plugin).

Share this post


Link to post
Posted (edited)

How is the temperature-based pausing and resuming of the parity check supposed to work? It seems like it ran through the entire time without pausing even though the debug messages clearly show drives were hot.

 

It does appear that the scheduled resume happened with all drives already hot/warm, does it only pause if a cool drive transitions to warm/hot? Perhaps there should be a check whether to start in the first place with the current temperature: some days are hot and it pushes them into the warm zone and the check shouldn't run at all.

 

Here is a log with the scheduled start time is 6am and end time is noon, but 2 drives are already hot, 3 warm. It appears to have run the whole way, repeating the message "drives=24, hot=2, warm=3, cool=19. Correcting Parity Check with all drives below temperature threshold for a Pause" every 5 minutes but not pausing. Highlights from the attached log:

 

Quote


Jul  3 05:55:01 tower parity.check.tuning.php: DEBUG: Monitor:  Parity check appears to be paused
Jul  3 05:55:01 tower parity.check.tuning.php: DEBUG: drives=24, hot=2, warm=3, cool=19
Jul  3 05:55:01 tower parity.check.tuning.php: DEBUG: Array operation paused but not for temperature related reason


Jul  3 06:00:01 tower parity.check.tuning.php: DEBUG: Resume request
Jul  3 06:00:01 tower parity.check.tuning.php: DEBUG: drives=24, hot=2, warm=3, cool=19
Jul  3 06:00:01 tower parity.check.tuning.php: DEBUG: ...configured scheduled action for Correcting Parity Check
Jul  3 06:00:01 tower parity.check.tuning.php: DEBUG: Array operation paused but not for temperature related reason
Jul  3 06:00:06 tower parity.check.tuning.php: DEBUG: Resumed Correcting Parity Check  (23.1% completed) 


Jul  3 06:05:01 tower parity.check.tuning.php: DEBUG: drives=24, hot=2, warm=3, cool=19
Jul  3 06:05:01 tower parity.check.tuning.php: DEBUG: Correcting Parity Check with all drives below temperature threshold for a Pause


Jul  3 12:00:01 tower parity.check.tuning.php: DEBUG: Pause request
Jul  3 12:00:09 tower parity.check.tuning.php: DEBUG: Pause of Correcting Parity Check  (50.9% completed) 

Also, I have 5 drives (2 hot, 3 warm), not 24. All 19 cool drives do not exist.

Unraid 6.8.3

Parity Tuning 2019.10.23

 

20200703_parity_tuning_temperature.log

parity_tuning_settings.png

Edited by robobub

Share this post


Link to post

You have uncovered a bug in the code for pausing on drives over-heating and I will rectify this.   When you have activated temperature related pause/resume then you will get a monitor task running every 5 minutes to check temperature - that is the "Monitor" log entry that is in your log.   If temperature monitoring is not active then this task is not needed. so you get a lot less logged.   The other resume/pause log entries are from the standard (not temperature related) pause/resume entries

 

What is MEANT to happen is that the Monitor task will list the overheated drives (assuming debug logging is active) each time following the summary message and then pause the parity check that is running.   It is this list + pause code that has the bug so that the pause is not taking place.  

 

I think the fact you have 19 "cool" phantom drives listed will be because you have 24 slots set on the Main tab (i.e. you have not reduced it to the number of drives you actually have).  Can you confirm if this is the case?  If so I need to add a check for slots allowed but not currently having a drive assigned to them to correct this count.  If that is not the case then I need to do further investigation to determine why the drive count might be wrong in your case.

 

I am in the middle of adding/testing multi-language support (ready for others to add translations to other languages) to the plugin so it will be at least a few days before I can release the fixes for the issues mentioned above.   Hopefully this will not inconvenience you too much.

Share this post


Link to post

i get this error when updating from 2019.10.23 to 2020.06.24

-----------------
plugin: updating: parity.check.tuning.plg
plugin: downloading: https://raw.githubusercontent.com/itimpi/parity.check.tuning/master/archives/parity.check.tuning-2020.06.24.txz ... failed (Invalid URL / Server error response)
plugin: wget: https://raw.githubusercontent.com/itimpi/parity.check.tuning/master/archives/parity.check.tuning-2020.06.24.txz download failure (Invalid URL / Server error response)
-----------------

Share this post


Link to post

That is strange. - I did not think I made a release at that point.  There has not been a release since last year as far as I know.  
 

I have been working on changes recently to support the UnRAID 6.9 multi-language feature and I think these are nearly complete so there is likely to be an update sometime in the next few days which IS meant to be there and will include this new feature as well as a number of minor fixes.

Share this post


Link to post
1 minute ago, Squid said:

That was not meant to be checked into github - I will have to revert it until I have finished testing my current changes.   I was only intending to check in the source file changes - not the .plg file.

Share this post


Link to post
On 3/2/2019 at 2:46 PM, itimpi said:
  • Auto detect idle periods:  The idea is that instead of the user having to specify specific start/stop times for running parity check increments the plugin should automatically detect periods when the system is idle to resume a parity check.   This would need the complementary option of automatically detecting the system is no longer idle so that the check can be paused.
  • Partial parity Checks: This is just a different facet of the ability to Resume parity checks on array start where you deliberately set the system up to perform part of a parity check with reboots in between the parts.

Those are the features I personally need most. 🤤

Thanks for finally bringing this to life.

Share this post


Link to post
59 minutes ago, Fireball3 said:

Those are the features I personally need most. 🤤

Thanks for finally bringing this to life.

At the moment I have no idea how to do detect idle periods for the first item and the second is not currently possible from a technical stand-point as it requires limetech to add a new capability to core UnRAID.

 

Share this post


Link to post

 

On 7/4/2020 at 2:56 AM, itimpi said:

I think the fact you have 19 "cool" phantom drives listed will be because you have 24 slots set on the Main tab (i.e. you have not reduced it to the number of drives you actually have).  Can you confirm if this is the case?  If so I need to add a check for slots allowed but not currently having a drive assigned to them to correct this count.  If that is not the case then I need to do further investigation to determine why the drive count might be wrong in your case.

I never realized I could change this, I did have this left at the default 24.

On 7/4/2020 at 2:56 AM, itimpi said:

You have uncovered a bug in the code for pausing on drives over-heating and I will rectify this.   When you have activated temperature related pause/resume then you will get a monitor task running every 5 minutes to check temperature - that is the "Monitor" log entry that is in your log.   If temperature monitoring is not active then this task is not needed. so you get a lot less logged.   The other resume/pause log entries are from the standard (not temperature related) pause/resume entries

 

What is MEANT to happen is that the Monitor task will list the overheated drives (assuming debug logging is active) each time following the summary message and then pause the parity check that is running.   It is this list + pause code that has the bug so that the pause is not taking place.  

On 7/4/2020 at 2:56 AM, itimpi said:

I am in the middle of adding/testing multi-language support (ready for others to add translations to other languages) to the plugin so it will be at least a few days before I can release the fixes for the issues mentioned above.   Hopefully this will not inconvenience you too much.

Not a problem at all, it's a low priority issue and the plugin otherwise works great. Thanks!

Share this post


Link to post

I’m have made available an updated version of the plugin that should fix all issues I currently know about.    This version is also ‘multi-language’ enabled if anyone feels the urge to provide translations to languages other than English.

Share this post


Link to post
Posted (edited)

First, thanks for this very handy plugin.

I think there are minor bugs in the temperature-pausing code:

  1. The assigned-slots check (which I believe you just added) reports an incorrect total. I have 3 array disks, 2 cache disks and 4 Unassigned Devices disks. The plugin shows 4 total (I assume array) disks
  2. Parity check is paused with hot=1, warm=2, cool=1 and resumed with hot=1, warm=1, cool=2. Seems like pause/resume should be based on hot alone
  3. The PAUSE debug reports the label of the overheated drive as "temp"
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: Parity check appears to be paused
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: drives=4, hot=1, warm=2, cool=1
Aug  1 07:50:01 NAS parity.check.tuning.php: Resumed Non-Correcting Parity Check  (57.8% completed)  as drives now cooled down
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: written RESUME (COOL) record to  /boot/config/plugins/parity.check.tuning/parity.check.tuning.progress
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: -----------MDCMD start------
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: detected that mdcmd had been called from sh with command mdcmd check RESUME 
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: -----------MDCMD end-------
Aug  1 07:50:01 NAS kernel: mdcmd (162): check RESUME
Aug  1 07:50:01 NAS kernel: 
Aug  1 07:50:01 NAS kernel: md: recovery thread: check P ...
Aug  1 07:50:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: drives=4, hot=1, warm=1, cool=2
Aug  1 07:55:01 NAS parity.check.tuning.php: Paused Non-Correcting Parity Check  (58.4% completed) : Following drives overheated: temp 
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: written PAUSE (HOT) record to  /boot/config/plugins/parity.check.tuning/parity.check.tuning.progress
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: -----------MDCMD start------
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: detected that mdcmd had been called from sh with command mdcmd nocheck PAUSE 
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: -----------MDCMD end-------
Aug  1 07:55:01 NAS kernel: mdcmd (163): nocheck PAUSE
Aug  1 07:55:01 NAS kernel: 
Aug  1 07:55:01 NAS kernel: md: recovery thread: exit status: -4
Aug  1 07:55:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------

 

Edited by CS01-HS

Share this post


Link to post

1. I will have to see if I can reproduce the problem you are seeing with the drives not having the number being listed correctly.  Any chance you can ssh into your system and make a copy of the file /var/local/emhttp/disks.ini.    That should give me the information I need to work out why the number of disks being reported is wrong.    Probably going to be something trivial to fix once I have an example from your system to work from.

 

2. It looks like there may be a bug in the Resume code in not correctly checking that all drives have cooled enough to resume - something I need to see if I can work out why.

 

3.   Worked out why you are getting a drive listed as ‘temp’ rather than the correct name - a case of a missing ‘$’ on a variable name so easy to fix.

 

BTW:  Much as I would like to take credit the MacVM docker is nothing to do with me.

Share this post


Link to post
14 minutes ago, itimpi said:

Any chance you can ssh into your system and make a copy of the file /var/local/emhttp/disks.ini.

Happy to. Attached with identifying information anonymized (I think.)

14 minutes ago, itimpi said:

missing ‘$’ on a variable name so easy to fix.

If I had a nickel...

14 minutes ago, itimpi said:

BTW:  Much as I would like to take credit the MacVM docker is nothing to do with me.

Ha! Yeah I realized I confused you two and edited my post but not in time (my apologies.)

disks.ini.txt

Share this post


Link to post

I think I have now fixed all 3 issues reported and am now doing some testing to check no obvious new bugs introduced.   I have made an additional log level of ‘testing’ available via the GUI in case I ever want a user to activate it.  Previously this was controlled by a local variable I set at my end but adding it to the GUI makes it easier to switch on/off and also means I will not accidentally issue a release with it left activated.

 

thanks for the disks.ini - it showed me that I was not correctly handling the case of a parity2 disk not being present when working out what slots contained drives.

 

The plugin ignores non-array drives - I am changing the message giving the count to make this clearer.

 

Share this post


Link to post

I have just pushed what I hope is the ‘fixed’ version of the plugin to GitHub.    Let me know if you notice any further anomalies/bugs.

  • Like 1
  • Thanks 2

Share this post


Link to post

I just updated and tested. (1) and (3) are fixed but re: (2) it looks like PAUSE (correctly) references the highest temperature disk but RESUME references the lowest temperature disk (it should be highest). Still, plenty good enough. Thank you.

Aug  1 12:20:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Aug  1 12:20:01 NAS parity.check.tuning.php: TESTING: parity temp=38 (settings are: hot=40, cool=35))
Aug  1 12:20:01 NAS parity.check.tuning.php: TESTING: disk1 temp=40 (settings are: hot=40, cool=35))
Aug  1 12:20:01 NAS parity.check.tuning.php: TESTING: disk2 temp=41 (settings are: hot=40, cool=35))
Aug  1 12:20:01 NAS parity.check.tuning.php: DEBUG: array drives=3, hot=2, warm=1, cool=0
Aug  1 12:20:01 NAS parity.check.tuning.php: Paused Non-Correcting Parity Check  (86.5% completed) : Following drives overheated: 40 41
Aug  1 12:20:01 NAS parity.check.tuning.php: DEBUG: written PAUSE (HOT) record to  /boot/config/plugins/parity.check.tuning/parity.check.tuning.progress
Aug  1 12:20:01 NAS parity.check.tuning.php: DEBUG: -----------MDCMD start------
Aug  1 12:20:01 NAS parity.check.tuning.php: DEBUG: detected that mdcmd had been called from sh with command mdcmd nocheck PAUSE
Aug  1 12:20:01 NAS parity.check.tuning.php: DEBUG: -----------MDCMD end-------
Aug  1 12:20:01 NAS kernel: mdcmd (169): nocheck PAUSE
Aug  1 12:20:01 NAS kernel:
Aug  1 12:20:02 NAS kernel: md: recovery thread: exit status: -4
Aug  1 12:20:02 NAS parity.check.tuning.php: TESTING: Heat notifications disabled so Pause Following drives overheated: 40 41  not sent
Aug  1 12:20:02 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Aug  1 12:35:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Aug  1 12:35:01 NAS parity.check.tuning.php: DEBUG: Parity check appears to be paused
Aug  1 12:35:01 NAS parity.check.tuning.php: TESTING: parity temp=38 (settings are: hot=40, cool=35))
Aug  1 12:35:01 NAS parity.check.tuning.php: TESTING: disk1 temp=40 (settings are: hot=40, cool=35))
Aug  1 12:35:01 NAS parity.check.tuning.php: TESTING: disk2 temp=41 (settings are: hot=40, cool=35))
Aug  1 12:35:01 NAS parity.check.tuning.php: DEBUG: array drives=3, hot=2, warm=1, cool=0
Aug  1 12:35:01 NAS parity.check.tuning.php: DEBUG: Array operation paused but drives not cooled enough to resume
Aug  1 12:35:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: Parity check appears to be paused
Aug  1 12:40:02 NAS parity.check.tuning.php: TESTING: parity temp=36 (settings are: hot=40, cool=35))
Aug  1 12:40:02 NAS parity.check.tuning.php: TESTING: disk1 temp=34 (settings are: hot=40, cool=35))
Aug  1 12:40:02 NAS parity.check.tuning.php: TESTING: disk2 temp=38 (settings are: hot=40, cool=35))
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: array drives=3, hot=0, warm=2, cool=1
Aug  1 12:40:02 NAS parity.check.tuning.php: Resumed Non-Correcting Parity Check  (86.5% completed)  as drives now cooled down
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: written RESUME (COOL) record to  /boot/config/plugins/parity.check.tuning/parity.check.tuning.progress
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: -----------MDCMD start------
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: detected that mdcmd had been called from sh with command mdcmd check RESUME
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: -----------MDCMD end-------
Aug  1 12:40:02 NAS kernel: mdcmd (170): check RESUME
Aug  1 12:40:02 NAS kernel:
Aug  1 12:40:02 NAS kernel: md: recovery thread: check P ...
Aug  1 12:40:02 NAS parity.check.tuning.php: TESTING: Heat notifications disabled so Resume Drives cooled down not sent
Aug  1 12:40:02 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------

 

Share this post


Link to post

Thanks for the feedback.

 

i will look at 2 again.   I may be letting the resume happen as soon as no disks are classed as ‘hot’, rather than checking they have cooled the amount specified in the plugin settings.   
 

it looks like making the Testing log option available in the GUI may have been a good decision in helping to get to the bottom of such issues.    Testing the temperature related settings has always been a bit difficult as I have no problems with my disks overheating so have to artificially force such failures.

Share this post


Link to post

Unraid was performing a parity check and it kept  saying temp to high pause parity.

 

But when i check the temp of the parity disk it wasnt too high.

The warning temp is set at 50 C.

894996989_Screenshotfrom2020-08-0313-46-23.png.098e98ae8c872cde7398347a994d3cf6.png

 

When checked the smart it says. Current 37 during parity check and Max temp 42.

655484115_Screenshotfrom2020-08-0313-45-21.png.ba824d57c31e53d06dfedd4741d24f99.png

 

So how can the parity check say temp is too high?

1964074660_Screenshotfrom2020-08-0316-30-40.thumb.png.8f66a8e81228553c7f856a2cf4c7e80b.png

Aug  2 12:29:46 ThaNekos kernel:
Aug  2 12:30:02 ThaNekos parity.check.tuning.php: Paused Read-Check  (24.2% completed) : Following drives overheated: temp temp
Aug  2 12:30:02 ThaNekos kernel: mdcmd (910): nocheck PAUSE
Aug  2 12:30:02 ThaNekos kernel:
Aug  2 12:30:05 ThaNekos kernel: md: recovery thread: exit status: -4
Aug  2 12:34:46 ThaNekos kernel: mdcmd (911): set md_write_method 0
Aug  2 12:34:46 ThaNekos kernel:
Aug  2 12:35:01 ThaNekos parity.check.tuning.php: Resumed Read-Check  (24.2% completed)  as drives now cooled down
Aug  2 12:35:01 ThaNekos kernel: mdcmd (912): check RESUME
Aug  2 12:35:01 ThaNekos kernel:
Aug  2 12:35:01 ThaNekos kernel: md: recovery thread: check ...
Aug  2 12:39:47 ThaNekos kernel: mdcmd (913): set md_write_method 1
Aug  2 12:39:47 ThaNekos kernel:
Aug  2 12:40:03 ThaNekos parity.check.tuning.php: Paused Read-Check  (24.4% completed) : Following drives overheated: temp temp
Aug  2 12:40:03 ThaNekos kernel: mdcmd (914): nocheck PAUSE
Aug  2 12:40:03 ThaNekos kernel:
Aug  2 12:40:06 ThaNekos kernel: md: recovery thread: exit status: -4
Aug  2 12:44:47 ThaNekos kernel: mdcmd (915): set md_write_method 0
Aug  2 12:44:47 ThaNekos kernel:
Aug  2 12:45:01 ThaNekos parity.check.tuning.php: Resumed Read-Check  (24.4% completed)  as drives now cooled down

Share this post


Link to post

It does not look as if you have the latest version of the plugin?   There have been some fixes released in the last few days that relate to temperature related pause/resume.   You can upgrade the plugin even if you already have a parity check running and the fixes should kick in.

 

if you still get unexpected results using the latest version of the plugin (2020.08.03) then please enable the “Testing” log level for the plugin in Settings->Scheduler and post (or PM me) the results.   When testing the temperature related feature I have to artificially force the conditions for pause/resume as normally my disks do not overheat so it is always possible that I am missing some real-world conditions.    That level of logging should tell me exactly what is going on so I know what to look for that might need fixing.

 

Share this post


Link to post

Aug 3 17:30:01 ThaNekos parity.check.tuning.php: DEBUG: -----------MONITOR start------
Aug 3 17:30:01 ThaNekos parity.check.tuning.php: DEBUG: array drives=6, hot=0, warm=0, cool=6
Aug 3 17:30:01 ThaNekos parity.check.tuning.php: DEBUG: Read-Check with all drives below temperature threshold for a Pause
Aug 3 17:30:01 ThaNekos parity.check.tuning.php: DEBUG: -----------MONITOR end-------

 

i enabled the Debug normal that is off. But i see 6 drives which is correct for the array.

 

It doesnt check the  Cache ones?

Share this post


Link to post

No, it is only checking array drives as they are the only ones which take part in parity checks (or other long running array operations).     If a sensible use case could be made for cache drives overheating to pause main array operations it would be easy to add that, but it has not seemed something to consider at the moment.

 

if you increase the log level to “testing” you will get even more detail about each drive as it is checked.

 

once you are happy that things are working as expected I recommend that you disable the debug logging to minimise writes to syslog.

 

FYI:   Hot means reached pause threshold temperature

        Warm means between pause and resume thresholds

        Cool means temperatures below threshold for resume.

 

Share this post


Link to post
1 hour ago, itimpi said:

No, it is only checking array drives as they are the only ones which take part in parity checks (or other long running array operations).     If a sensible use case could be made for cache drives overheating to pause main array operations it would be easy to add that, but it has not seemed something to consider at the moment.

 

if you increase the log level to “testing” you will get even more detail about each drive as it is checked.

 

once you are happy that things are working as expected I recommend that you disable the debug logging to minimise writes to syslog.

 

FYI:   Hot means reached pause threshold temperature

        Warm means between pause and resume thresholds

        Cool means temperatures below threshold for resume.

 

Ok thx :) will put it on testing as its still doing its check.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.