[Plugin] Parity Check Tuning


Recommended Posts

@Quejo Thanks, your last diagnostics showed me where the plugin was losing track of the fact a pause had been done because the drives overheated and I have pushed an update that corrects that.  The question is whether there is any other lurking bug in this area, and if so another run with Testing logging active should show me where.  I have slightly improved my logging in this area to help me identify problems.

 

Note that if you have not set the option to run Manual checks in increments then it should not matter if the pause/resume because drives get hot fall inside the increments time window when you actually do a Manual check.  

 

Link to comment
2 hours ago, itimpi said:

@Quejo Thanks, your last diagnostics showed me where the plugin was losing track of the fact a pause had been done because the drives overheated and I have pushed an update that corrects that.  The question is whether there is any other lurking bug in this area, and if so another run with Testing logging active should show me where.  I have slightly improved my logging in this area to help me identify problems.

 

Note that if you have not set the option to run Manual checks in increments then it should not matter if the pause/resume because drives get hot fall inside the increments time window when you actually do a Manual check.  

 

I will test this new version and report with logs here.

You said that if parity was paused bacause disk overheating and if it spin down before reaching resume temperature so it would be treated as cooled down and parity should be resumed. Just as an ideia wouldn't be a better option keep the disk spinning in that situation until it really cool down and so resume parity would make sense? 

I think that if parity resuming when disk spin down would have trouble if it's still warm and pause parity again and restart again after spinning down and so...

Link to comment
4 minutes ago, Quejo said:

Just as an ideia wouldn't be a better option keep the disk spinning in that situation until it really cool down and so resume parity would make sense? 

I think that if parity resuming when disk spin down would have trouble if it's still warm and pause parity again and restart again after spinning down and so...

The problem is that at the moment I have no logic that would keep the drive spinning so something like that would take some research (although I have some ideas).  I think it is worth waiting to see if it really would be needed as it adds complications that might not be needed.   If it was done it would need to take into account if the check has passed the size of the disk as in that case you would not want to keep it spinning.

Link to comment
36 minutes ago, itimpi said:

The problem is that at the moment I have no logic that would keep the drive spinning so something like that would take some research (although I have some ideas).  I think it is worth waiting to see if it really would be needed as it adds complications that might not be needed.   If it was done it would need to take into account if the check has passed the size of the disk as in that case you would not want to keep it spinning.

Great ideias.

Before updating to the latest version you pushed I realized that plugin was pausing and resuming as it should so I took the logs.

Another issue that I realized is that warning temperature is set to 60 and pause threshold set to 3 degrees below that but it's only pausing in a higher temperature like 61 or 62.

tower-diagnostics-20220405-1235.zip

Edited by Quejo
Link to comment
20 minutes ago, Quejo said:

Another issue that I realized is that warning temperature is set to 60 and pause threshold set to 3 degrees below that but it's only pausing in a higher temperature like 61 or 62.

 

From what I can see in the diagnostics this is due to the  speed at which the drives are heating up between the checks to monitor temperatures so the drives temps are overshooting before the excessive  heat is detected.  At the moment that is only every 7 minutes  (I did not want it to be too frequent to keep the cost of monitoring down) when temperature monitoring is active - maybe I need to consider using a more frequent check. 

 

I could introduce easily some settings into the plugin's .cfg file that control this interval and that are not exposed in the GUI but could be changed by editing the file directly to allow for experimentation with different values.  The downside is that more frequent checks would result in more log entries for everything except Basic mode logging unless I revisit what gets logged at the different levels.

Link to comment
1 hour ago, itimpi said:

 

From what I can see in the diagnostics this is due to the  speed at which the drives are heating up between the checks to monitor temperatures so the drives temps are overshooting before the excessive  heat is detected.  At the moment that is only every 7 minutes  (I did not want it to be too frequent to keep the cost of monitoring down) when temperature monitoring is active - maybe I need to consider using a more frequent check. 

 

I could introduce easily some settings into the plugin's .cfg file that control this interval and that are not exposed in the GUI but could be changed by editing the file directly to allow for experimentation with different values.  The downside is that more frequent checks would result in more log entries for everything except Basic mode logging unless I revisit what gets logged at the different levels.

Is the plugin temperature reading frequency anyway related to the "Settings -> Disk settings -> Tunable (poll_attributes)"?

Edited by Quejo
Link to comment
26 minutes ago, Quejo said:

Is the plugin temperature reading frequency anyway related to the "Settings -> Disk settings -> Tunable (poll_attributes)"?

No, the plug-in sets up cron jobs with the frequency chosen according to the current plugin settings.    The values I have chosen are basically arbitrary and ones I have chosen as seeming appropriate.    It is possible, however that the setting you mention DOES affect the frequency at which the temperature values I read in the plugin are updated, but it is not something I have looked into.

Link to comment
On 4/5/2022 at 1:49 PM, itimpi said:

 

From what I can see in the diagnostics this is due to the  speed at which the drives are heating up between the checks to monitor temperatures so the drives temps are overshooting before the excessive  heat is detected.  At the moment that is only every 7 minutes  (I did not want it to be too frequent to keep the cost of monitoring down) when temperature monitoring is active - maybe I need to consider using a more frequent check. 

 

I could introduce easily some settings into the plugin's .cfg file that control this interval and that are not exposed in the GUI but could be changed by editing the file directly to allow for experimentation with different values.  The downside is that more frequent checks would result in more log entries for everything except Basic mode logging unless I revisit what gets logged at the different levels.

After latest update pausing and resuming are working correctly. Issue now is the temperature reading interval since disks are overheating above warning temperature. Maybe the disks case fan is now working properly contribuing to the fast warming.

tower-diagnostics-20220406-1541.zip

Link to comment
10 minutes ago, itimpi said:

OK - Good to know the basic mechanism is working properly now.

 

I can post a release that will allow you to adjust the intervals at which the monitor task runs if you want to check out if shorter intervals give you better control.

that woud be great. Thank you for your efford.

Link to comment
14 hours ago, Quejo said:

that woud be great. Thank you for your efford.

I have now pushed the release that allows you to manually configure the plugin's monitor task timeouts

 

Since you already have the plugin installed you should start by using the Defaults button to get the entries into the stored parity.check.tuning.cfg file in the plugins folder on the flash.   Now set the settings in the GUI as you want them and press Apply to update that file with those settings.  At this point you  can now manually edit this file to play with monitor task frequency with the entries of interest being:

parityTuningMonitorHeat="7"

which is the monitor frequency (in minutes) if you have enabled the plugin option to have the plugin checking for temperatures.  This is the delay that could happen before the plugin even detects you have started a parity check, and I suspect changing it would not make much difference to overall behaviour but you are welcome to try to see if it does.

 

parityTuningMonitorBusy="6"

which it the monitor frequency that the plugin sets after it has detected an array operation is active.  I think this is the setting that is most likely to reduce the chances of temperature overshoot for you and I would be interested to know if reducing the value helps in any way.

 

Let me know how things go.  I would be interested in seeing any diagnostics with at least Debug logging enabled regardless of the outcome of your tests.

Link to comment

Unraid seems to be logging scheduled checks as manual (I'm running 6.10.0-rc4.) 

 

I noticed it because my scheduled check, which should have paused in the morning, was still running. (I don't have it set to pause manual checks.)
 

Debug log from scheduled check:

Apr  7 00:05:01 NAS Parity Check Tuning: DEBUG:   Manual Non-Correcting Parity Check running
Apr  7 00:05:01 NAS Parity Check Tuning: DEBUG:   Resume request
Apr  7 00:05:01 NAS Parity Check Tuning: DEBUG:   ... Manual Non-Correcting Parity Check already running
Apr  7 00:07:01 NAS Parity Check Tuning: DEBUG:   Manual Non-Correcting Parity Check running
Apr  7 00:07:01 NAS Parity Check Tuning: DEBUG:   array drives=4, hot=0, warm=0, cool=4
Apr  7 00:07:01 NAS Parity Check Tuning: DEBUG:   All array drives below temperature threshold for a Pause

 

The plugin's reporting matches unraid's history:

1440951995_ScreenShot2022-04-07at7_40_27AM.png.5afe85921d02975609da648ddcc6be84.png

so I doubt it's a plugin problem, but I'm posting here because unless you're using it you probably won't notice.

 

Anyone else seeing this?

Edited by CS01-HS
Link to comment
47 minutes ago, CS01-HS said:

The plugin's reporting matches unraid's history:

1440951995_ScreenShot2022-04-07at7_40_27AM.png.5afe85921d02975609da648ddcc6be84.png

so I doubt it's a plugin problem, but I'm posting here because unless you're using it you probably won't notice.

 

Anyone else seeing this?

 

I have reproduced this and It IS a a plugin bug and I will get a fix out.  I think it is a typo introduced in a recent update that is causing this.

 

BTW:  The Parity History entry is from the plugin as well so it is not surprising they agree :) 

  • Thanks 1
Link to comment
17 minutes ago, CS01-HS said:

 

Ha! Well that explains it, thanks.

It is slightly more complicated than that under the covers :)

 

Initially the Unraid built-in parity check code writes a history record, but with less detail than the plugin can provide.   The plugin then later updates that record with additional information.  If you look just after the check finishes then you may see the information written by the built-in code if the plugin has not gotten around to updating the record.

  • Like 1
Link to comment
4 hours ago, Quejo said:

I think that there's another bug here. The plugin doesn't wait disks cool down to the resume temperature before resuming as it resumes earlier

tower-diagnostics-20220407-1329.zip 233.7 kB · 0 downloads

 

not sure there is a bug per se here as I think it is a side-effect of the disks spinning down which the plugin is not currently designed to handle in the middle of a check (and where the check has not got beyond the drive size).
 

Looking at the log in the diagnostics it seems all the array drives had spun down (and thus were assumed to be cool).  This is why some of the drives had a temperature logged as *C at that point ).   That was when the plug-in decided the drives had cooled down enough and restarted the array operation.   This caused the drives to spin up and so their temperature could be read again.  

 

There is also something going on that I cannot fathom in that the disks are reported as having exceeded the ‘critical’ value, not merely the warning level.   It is as if the plugin is getting a value of 0C for that value from the Unraid configuration files.    I might like to see the dynamix.cfg file from your system to see what is set there.

Link to comment
13 hours ago, itimpi said:

 

not sure there is a bug per se here as I think it is a side-effect of the disks spinning down which the plugin is not currently designed to handle in the middle of a check (and where the check has not got beyond the drive size).
 

Looking at the log in the diagnostics it seems all the array drives had spun down (and thus were assumed to be cool).  This is why some of the drives had a temperature logged as *C at that point ).   That was when the plug-in decided the drives had cooled down enough and restarted the array operation.   This caused the drives to spin up and so their temperature could be read again.  

 

There is also something going on that I cannot fathom in that the disks are reported as having exceeded the ‘critical’ value, not merely the warning level.   It is as if the plugin is getting a value of 0C for that value from the Unraid configuration files.    I might like to see the dynamix.cfg file from your system to see what is set there.

hello

here it its

dynamix.cfg

Link to comment
33 minutes ago, Quejo said:


Thanks,   I can see that you have set the critical disk threshold to 0 which explains some of the unexpected values I was seeing in the logs around the critical value.   I need to cater for that special case to tidy things up.   I also need to cater for the warning threshold set to 0 as that is also a legitimate case.

 

Regarding the fact that your disks spin down when the parity check is paused so that their temperature can not be read I am looking at some logic that will create then delete a file on such disks (which should spin  them up) and then wait for the next monitor point before making a decision on whether the temperatures (which should now be readable) are correct for a resume.   Whether this will work as I hope and help in your case I am not certain but it feels like it could be worth trying.   It is definitely very much an edge case as most people do not spin their disks down as aggressively, but maybe with energy prices rising rapidly it may become more common.

 

 

Link to comment
9 hours ago, itimpi said:


Thanks,   I can see that you have set the critical disk threshold to 0 which explains some of the unexpected values I was seeing in the logs around the critical value.   I need to cater for that special case to tidy things up.   I also need to cater for the warning threshold set to 0 as that is also a legitimate case.

 

Regarding the fact that your disks spin down when the parity check is paused so that their temperature can not be read I am looking at some logic that will create then delete a file on such disks (which should spin  them up) and then wait for the next monitor point before making a decision on whether the temperatures (which should now be readable) are correct for a resume.   Whether this will work as I hope and help in your case I am not certain but it feels like it could be worth trying.   It is definitely very much an edge case as most people do not spin their disks down as aggressively, but maybe with energy prices rising rapidly it may become more common.

 

 

My major problem is that im using an Orico 5 bay HDD case for my unraid parity storage. Today i raplaced the crappy fan with one that has much higher static pressure but when all disks are spinning temps still go higher than 60. the only way to keep temps down or cool them down is by spinning them down.

Edited by Quejo
Link to comment
12 hours ago, Quejo said:

the only way to keep temps down or cool them down is by spinning them down.

Do your disks cool down if they are spun up and idling?  If so my idea for spinning them up when I find they are spun down so that I can read their temperatures is not going to work very well.  Your diagnostics suggest they might but I am not certain it is enough?  I am assuming they do as otherwise they would be overheating even when not doing a parity check

Link to comment
On 4/9/2022 at 5:27 AM, itimpi said:

Do your disks cool down if they are ‎dizer‎ up and idling?  If so my idea for spinning them up when I find they are spun down so that I can read their temperatures is not going to work very well.  Your diagnostics suggest they might but I am not certain it is enough?  I am assuming they do as otherwise they would be overheating even when not doing a parity check

They barely do.

Link to comment
1 hour ago, Quejo said:

They barely do.

Not sure then quite how well the code I have put in place to handle spin downs will work.

 

At least what I have implemented will not attempt to resume the check until the drives HAVE cooled down sufficiently although it might spin them up and then find the drives are still too hot and will continue waiting (and possibly another spindown/spinup sequence will happen).

 

I guess it is going to be a case of releasing this change (after a bit more testing) to see what happens :) 

Link to comment
1 hour ago, itimpi said:

Not sure then quite how well the code I have put in place to handle spin downs will work.

 

At least what I have implemented will not attempt to resume the check until the drives HAVE called down sufficiently although it might spin them up and then find the drives are still too hot and will continue waiting (and possibly another spindown/spinup sequence will happen).

 

I guess it is going to be a case of releasing this change (after a bit more testing) to see what happens :) 

asa you release it i will test and report. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.