[Plugin] Parity Check Tuning


387 posts in this topic Last Reply

Recommended Posts

Since I have remote plex users these days I decided I needed to to something about my monthly parity check.

 

I manually paused the monthly during the day, and after restarting and completing it on the 2nd night, I installed this plugin.

 

I get the error mentioned when checking History, but the plugin hasn't actually needed to do anything yet since there haven't been any parity checks since I installed it.

 

Don't know if that is useful to debugging this issue or not.

Link to post
  • Replies 386
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

Parity Check Tuning plugin   The Parity Check Tuning plugin is  primarily designed to allow you to split a parity check into increments and then specify when those increments should be run.

i have been working through all cases where this can happen in the code and I think I now have them all fixed in the version running on my test server.  There was a number of places in the code where

I am currently working on the code to allow array operations to be restarted (resumed) from where they were as long as: the array was shutdown cleanly there have been no changes to the

Posted Images

3 hours ago, trurl said:

Since I have remote plex users these days I decided I needed to to something about my monthly parity check.

 

I manually paused the monthly during the day, and after restarting and completing it on the 2nd night, I installed this plugin.

 

I get the error mentioned when checking History, but the plugin hasn't actually needed to do anything yet since there haven't been any parity checks since I installed it.

 

Don't know if that is useful to debugging this issue or not.


The error displayed trying to look at the parity check history is completely divorced from the main operation of the plugin so in that sense you should not be affected by it.

 

in addition it does not seem to be fatal even when trying to display the history.  However since the line it complains about is in a file that handles multi-language support in unRaid this may not be true if you have a language other than English set in unRaid (I have not tested if this is the case).   I am still trying to track down why it happens in the first place on one of my unRaid systems and not another both of which are running unRaid 6.9.1.

Link to post
10 hours ago, itimpi said:

 

Just thought I would let you know that tracking down the cause of this bug is proving more elusive than I had expected.  As I mentioned I can reproduce this bug on one 6.9.1 system and not another.  I have checked and they both have identical copies of the script for displaying the history, and on one it works without error and the other displays the error you see.  Still trying to pin down the difference that causes this issue.

I have this message too.  Is there anything that I can provide that may help?

Link to post
22 minutes ago, S80_UK said:

I have this message too.  Is there anything that I can provide that may help?

Not at the moment as I have a system that can reproduce the symptoms.

Link to post

Finally tracked down what has been causing the error line when displaying the Parity Check history.    Turned out to be a hidden ESC character that had crept into the script file.   Was not enough to cause a syntax check of the file to show an error, but it was enough to upset displaying the history results (deep inside one of the unRaid GUI support functions).   I will shortly issue an update to fix this.

Link to post

Hi @itimpi - I just thought I would update you.  I have run several scheduled and manual checks over the past few days  using 6th April version and 2nd April version, and I have tested with and without scheduled pauses of the checking.  All tests have worked and run to completion exactly as they should.  The only cases I have not tested are for unplanned checks after an unclean shutdown. 

 

So thanks again for your work on this, it's looking very good. 

Link to post
1 minute ago, S80_UK said:

Hi @itimpi - I just thought I would update you.  I have run several scheduled and manual checks over the past few days  using 6th April version and 2nd April version, and I have tested with and without scheduled pauses of the checking.  All tests have worked and run to completion exactly as they should.  The only cases I have not tested are for unplanned checks after an unclean shutdown. 

 

So thanks again for your work on this, it's looking very good. 


Thanks for the feedback.   Nice to know there is no issue outstanding that I should be looking at with any sense of urgency.  
 

The unplanned checks I think should be OK as since it was a new feature it got a bit more testing in the recent releases than previously existing functionality.

Link to post
On 4/2/2021 at 1:12 PM, itimpi said:

 

Not sure why you should get this if on the latest release of the plugin :( 

 

To help with diagnosing the cause can you install the new version of the plugin (2021-04-02) I have just released and then in the plugin's settings set the logging level to "Testing" and select the one of the options for logging to flash.   The plugin will now create a file called 'parity.tuning.log' in the plugins folder on the flash drive when it is running so if you can recreate the issue and let me have that file it should help me pinpoint where things are going wrong.

 

 

Hi, since I had not really the time in the past weeks to do so, I had now.  I installed the newest Version 2021.04.13 and set it to Testing and writing Syslog and Flash. 

 

0,7% later I get the error while the drives aren't hot (36-37°C). 

 

You said the log is the "parity.tuning.log", but I've only a "parity.check.tuning.log". Is that the one? It is in the folder "/boot/config/plugins/parity.check.tuning". I saved all of the files from this specific folder.

 

I attached the file and hope this can help to make tings clear. Maybe you need another file oder something else? Just ask and I can provide it. Thank you for your help. The Temperature is THE feature, why I installed this plugin. 

 

Kind Regards

 

Nils

 

 

EDIT:

What I don't understand. When the check is running with about 260MB/s, why should it need over 13 hours, when it reaches 0,7% in less than a minute? When 0,7% was exact a minute the estimated time should be about 143 minutes.

 

Edit 2:

I resumed and it stopped again at 1,4% (2x0,7...). with disk temperatures of 38, 39 and 40°C. 

What I noticed: I could not click on resume, because the button showed "pause". Refreshing the site helps and the button showed "resume".

 

Edit 3:

Same with 2,1%. Same temperatures as with 1,4%. It could be a very big coincidence, but it seems to pause with every 0,7% of progress.

 

Edit 4:

2,8% was no problem, but it stopped at 2,9%. Maybe some rounding? I resumed and disabled pause when overheating. At the check reaches 4,2% I reenabled the pause when overheating, again.

0,4% later at 4,6% it pause again. I attached the log, too.

 

parity.check.tuning.log

parity.check.tuning.log

Edited by Marino
Link to post
52 minutes ago, Marino said:

but I've only a "parity.check.tuning.log"

That is the correct file - sorry about giving you the wrong name initially.

54 minutes ago, Marino said:

What I don't understand. When the check is running with about 260MB/s, why should it need over 13 hours, when it reaches 0,7% in less than a minute? When 0,7% was exact a minute the estimated time should be about 143 minutes

Early on the speed is just a rough estimate - it gets more accurate as you go further.

 

56 minutes ago, Marino said:

What I noticed: I could not click on resume, because the button showed "pause". Refreshing the site helps and the button showed "resume".

 

You are correct - you have to do a refresh to get it to show correctly.   I have asked if there is any way to force a refresh from within the plugin but had no reply that gives a way to do this.

 

58 minutes ago, Marino said:

Same with 2,1%. Same temperatures as with 1,4%. It could be a very big coincidence, but it seems to pause with every 0,7% of progress.

It will be a coincidence as there is nowhere in the plugin that is monitoring the percentage - it is just used for display purposes.

 

I'll give more feedback when I have had a chance to look at that log

Link to post

@Marino Found out what looks like the cause of your temperature problems.   I think the plugin is working correctly but you have misunderstood the way the temperature values are used for the temperature related pause/resume :(   From the log I think that you have entered actual temperatures rather than the amount away from the warning threshold set for the drive in the drive's settings?

The reason you do not get an immediate pause is that the task that looks for over-heating drives only runs at regular intervals (currently set to be every 7 minutes). 

 

As an example if the warning threshold on a particular drive is 50C then entering values of 

Pause=2 means pause at 48C (50-2)

Resume=7 means resume at 43C (50-7)

Unraid provides a global value for the warning threshold under Settings->Disk Settings but allows you to override the global value at the individual drive level by clicking on it on the Main tab. The plugin works this way as different drives can have different values set at the unRaid level so using relative values means each drive can potentially have different pause/resume temperatures. 

 

Can you please confirm that my analysis is correct?   If it is I will enhance the built-in help with a worked example of the type given above.  I will also add some sort of upper limit to the values that can be entered to try and pick this type of misunderstanding up from the outset on the plugin's settings page.

Link to post

Yes, this is correct. I entered 47°C for pause and 42°C for resume. Could it be that this was a correct way to select the temperature in the past? I haven't used the unraid server not very much in the several year. Before that I used the plugin and it was working.

 

Maybe these are my old settings on a newer plugin? I also used time for pause and resume in the past. Now I see it is in crontab format. This has changed too. Maybe the settings for temperature too?

 

Thanks for looking at the log. I am in the middle of checking parity as I am getting a new drive which shoud be the new parity drive and the "old 12 TB" drive should be a data disk then. In the past the temperatures were increasing very much while building parity and I don't know why. This plugin should help me to get this job done without any damage.

 

The first time I am building parity on this (3x12TB) the disks are reaching 53°C while in the middle of airflow. I build first parity with an open Window in winter (cold and dry outside)... Thats why the plugin is important for me. Now 60% of the parity was checked and the temperatures are the same as at beginning (39-40°C). Don't know why they were increasing that much on building first parity. 

Link to post
31 minutes ago, Marino said:

Yes, this is correct. I entered 47°C for pause and 42°C for resume. Could it be that this was a correct way to select the temperature in the past? I haven't used the unraid server not very much in the several year. Before that I used the plugin and it was working

It has always been specified this way.   At one point the temperature option was not working properly so it may well have been that when you had that setting and the setting was simply having no effect.

 

32 minutes ago, Marino said:

I also used time for pause and resume in the past. Now I see it is in crontab format. This has changed too. Maybe the settings for temperature too?


If you have specified Daily (which is the default) for the frequency then you specify time in hours + minutes.   Originally this was the only option.     If you specify Custom as the frequency then you can use cron tab format which gives you more control at the expense of not being as convenient to use.   This was added some time ago now and is very useful to me when testing as it allows for options that are not simply daily.

Link to post

Maybe thats is the fact. I set it to 47°C and it wasn't working flawlessly and wasn't noticed because it runs without problems and the disks weren't hot. The last time i checked the parity is over a year ago (most time the server was not running).

 

Good to hear that daily has the right time. Because the server ist most of the time switched off, I'll start the parity check manually, so I the "normal" time format fits better for me ;)

 

Thank you for your help and for you awesome plugin!

Link to post
  • 3 weeks later...

Having a problem with parity check reporting overheated, but I cannot figure out why:

 

Quote

2021 May 01 12:39:15 TOWER Parity Check Tuning: TESTING: ----------- UPDATECRON begin ------
2021 May 01 12:39:15 TOWER Parity Check Tuning: TESTING: Deleted cron marker file 
2021 May 01 12:39:15 TOWER Parity Check Tuning: DEBUG:   created cron entry for scheduled pause and resume
2021 May 01 12:39:15 TOWER Parity Check Tuning: DEBUG:   created cron entry for default monitoring 
2021 May 01 12:39:15 TOWER Parity Check Tuning: TESTING: updated cron settings are in /boot/config/plugins/parity.check.tuning/parity.check.tuning.cron
2021 May 01 12:39:15 TOWER Parity Check Tuning: TESTING: ----------- UPDATECRON end ------
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: ----------- MONITOR begin ------
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: progress marker file present
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: disks marker file present
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: hot marker file present
2021 May 01 13:17:02 TOWER Parity Check Tuning: DEBUG:   Parity check appears to be paused
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: plugin temperature settings: Pause 3, Resume 8
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: Drive 84 appears to be critical
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: parity temp=84, status=critical (drive settings: hot=27, cool=18)
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: Drive 84 appears to be critical
2021 May 01 13:17:02 TOWER Parity Check Tuning: TESTING: disk1 temp=84, status=critical (drive settings: hot=27, cool=18)
2021 May 01 13:17:03 TOWER Parity Check Tuning: TESTING: Drive 82 appears to be critical
2021 May 01 13:17:03 TOWER Parity Check Tuning: TESTING: disk2 temp=82, status=critical (drive settings: hot=27, cool=18)
2021 May 01 13:17:03 TOWER Parity Check Tuning: TESTING: Drive 91 appears to be critical
2021 May 01 13:17:03 TOWER Parity Check Tuning: TESTING: disk3 temp=91, status=critical (drive settings: hot=27, cool=18)
2021 May 01 13:17:03 TOWER Parity Check Tuning: DEBUG:   array drives=4, hot=4, warm=0, cool=0
2021 May 01 13:17:03 TOWER Parity Check Tuning: Paused Correcting Parity Check  (3.7%% completed): Following drives overheated: parity(84) disk1(84) disk2(82) disk3(91) 
2021 May 01 13:17:03 TOWER Parity Check Tuning: TESTING: PAUSE (HOT) record to be written
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: written PAUSE (HOT) record to  progress marker file 
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: ----------- MDCMD begin ------
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: progress marker file present
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: disks marker file present
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: hot marker file present
2021 May 01 13:17:04 TOWER Parity Check Tuning: DEBUG:   detected that mdcmd had been called from sh with command mdcmd nocheck PAUSE 
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: CANCELLED record to be written
2021 May 01 13:17:04 TOWER Parity Check Tuning: TESTING: written CANCELLED record to  progress marker file 
2021 May 01 13:17:05 TOWER Parity Check Tuning: TESTING:  array operation still running - so not time to analyze progess
2021 May 01 13:17:05 TOWER Parity Check Tuning: TESTING: Deleted cron marker file 
2021 May 01 13:17:05 TOWER Parity Check Tuning: DEBUG:   created cron entry for scheduled pause and resume
2021 May 01 13:17:05 TOWER Parity Check Tuning: DEBUG:   created cron entry for default monitoring 
2021 May 01 13:17:05 TOWER Parity Check Tuning: TESTING: updated cron settings are in /boot/config/plugins/parity.check.tuning/parity.check.tuning.cron
2021 May 01 13:17:05 TOWER Parity Check Tuning: TESTING: ----------- MDCMD end ------
2021 May 01 13:17:05 TOWER Parity Check Tuning: TESTING: Heat notification message: Pause: Following drives overheated: parity(84) disk1(84) disk2(82) disk3(91) 
2021 May 01 13:17:06 TOWER Parity Check Tuning: TESTING: Send notification: Pause: Following drives overheated: parity(84) disk1(84) disk2(82) disk3(91) <br>Correcting Parity Check (3.7%% completed)
2021 May 01 13:17:06 TOWER Parity Check Tuning: TESTING: ... using /usr/local/emhttp/webGui/scripts/notify -e 'Parity Check Tuning' -i 'normal' -s '[TOWER] Pause' -d 'Following drives overheated: parity(84) disk1(84) disk2(82) disk3(91) <br>Correcting Parity Check (3.7%% completed)'
2021 May 01 13:17:10 TOWER Parity Check Tuning: TESTING: ----------- MONITOR end ------
 

My drive settings have temp warning at 45C and each drive shows the same.  I dont know where the 27C value for the shutdown is coming from.

 

Any Ideas where to check?

Link to post

Strange - it looks like some of those messages may be reported in Celsius and others in Fahrenheit?    I wonder if that is causing an inconsistency somewhere?   What temperature unit do you have set under the unRaid Display settings?   Also, can you check which version of the plugin is installed?

 

it might also be useful if you can provide the contents of your config/plugins/dynamix/dynamix.cfg file from the flash drive so I can check the temp unit setting on your system.   I would expect the temperatures in the log messages to have C or F appended and that does not appear to be happening.   On checking the code the only way I can see this happening I if a temperature unit type is not set in the .cfg file as the plugin does not assume any default (which I can change so it does).

 

EDIT:  I have found that you san definitely get some unexpected behaviour if the temperature unit is not set at the unRaid level.  I am ready to push out an update for the plugin that applies a default if not set at the unRaid level.  I would be grateful to know if you can let me know if going into Settings -> Display Settings and setting Celsius and hitting Apply fixes your problem as that would confirm I am fixing the correct issue.

 

 

 

 

Link to post
Quote

[parity]
mode="3"
day="6"
hour="0 0"
write=""
dotm="1"
[ssmtp]
service="::NO:NO:none"
SetEmailPriority="True"
Subject="unRAID Status: "
port="465"
UseTLS="YES"
UseSTARTTLS="NO"
UseTLSCert="NO"
[notify]
entity="1"
normal="3"
warning="3"
alert="3"
unraid="3"
plugin="3"
docker_notify="3"
report="3"
display="0"
date="d-m-Y"
time="H:i"
position="top-right"
path="/tmp/notifications"
system="*/1 * * * *"
unraidos="11 0 * * 1"
version="10 0 1 * *"
docker_update="10 0 1 * *"
status="20 0 * * 1"
[display]
font=""
date="%c"
number=".,"
scale="-1"
tabs="1"
users="Tasks:3"
resize="0"
wwn="0"
total="1"
usage="0"
header=""
background=""
banner=""
dashapps="icons"
theme="white"
text="1"
unit="C"
 

 

Here is the dynamix.cfg (some entries removed email/password etc).

 

I did try to make sure that the Display settings were C.  When I viewed the settings, it was C but I toggled it to F and back to C.  Then resumed parity check.  It failed again.

Link to post
  • 4 weeks later...

New feature request

 

Run parity check after X amount of data has been written to the array since the last parity check.

 

I've been wondering how much sense it makes to do a parity check when we aren't writing much data to the array?

Link to post
14 minutes ago, badi95 said:

New feature request

 

Run parity check after X amount of data has been written to the array since the last parity check.

 

I've been wondering how much sense it makes to do a parity check when we aren't writing much data to the array?

Parity has nothing to do with individual data files, it's calculated across the entire capacity full or empty of all the drives.

 

If parity is not able to be checked successfully, a failed disk will not rebuild properly. Think of it like a confidence check.

Link to post
2 hours ago, jonathanm said:

Parity has nothing to do with individual data files, it's calculated across the entire capacity full or empty of all the drives.

 

If parity is not able to be checked successfully, a failed disk will not rebuild properly. Think of it like a confidence check.

I guess my question is, if the data isn't changing much, do we need to check as often? If I was last confident in my parity at 2TB, do I need to re-check my confidence at 2.01 TB or can I wait until 2.1TB? Is confidence inversely proportional to amount of new data written or is it based on another factor like time?

Link to post

Time is probably the key criteria for when checks should be done.   Most people seem to select something like Monthly or quarterly for scheduled checks.

 

Link to post
  • 2 weeks later...

Issues with the temperature pausing/restart. It's pausing when disks are still in normal temp range. 

I started using this plugin because I had a case that I had neglected and needed to finish a parity check before cleaning it out. I don't have AC in the room the computer normally runs, so I figured this would also help with those upcoming hot summer days/nights. Run it at night to do parity checks and pause in case there is a heat issue.

 

Anyway, I replaced a drive and I'm trying to a do a rebuild on the drive. But it keeps pausing the rebuild even though temps are no where near the warning state. (I have a warning temp on all drives of 113 Fahrenheit or 45 Celsius.) It was also doing the same when it was running the parity check. It would pause when temps were still well below, though now it seems even more sensitive when running the rebuild. 

 

The plugin is set to pause at -1 (so 114 Fahrenheit or 46 Celsius) and restart at 8 degrees below the warning temp. (105 Fahrenheit or 37 Celsius).  Assuming I have those numbers accurate in what I understand from this forum and the help info, then it's pausing when the disks are well below those temps.

 

I had my system set to Fahrenheit. I changed it to Celsius to see if that would affect the problem since I saw a few posts back that you suggested it incase the default wasn't set. But it doesn't seem to matter. (Though with Celsius and now doing a rebuild,  it seems almost more touchy than it was when it was set to Farenheit during the parity check.)

 

The latest pause pushed out this notification:

Parity Check Tuning: 2021-06-01 13:35

[TOWER] Pause
Following drives overheated: parity(31C) disk1(31C) disk2(32C) disk3(28C) disk4(29C) disk5(31C) disk6(31C) disk7(26C) disk8(31C)
Parity Sync/Data Rebuild (2.0%% completed)"

 

I cleaned out the case and removed the dead fan that was the primary culprit of the heat issue I was having. I'm not sure why it's stopping as you can see none of the drives are overheating. I would think it would just tell which drive was overheating and not push out info for all 9 drives, but it's saying all 9. 

 

If I can help provide anything else, please let me know. 

 

I've actually disabled the temp pause feature for the time being so the rebuild can continue, and it's been running with the temperature of the drives staying pretty consistently well below 32 Celsius. But I've got more changes to the system coming. (Replacing another fan, replacing another drive, maybe adding a second parity drive, assuming I can find a way to power an extra drive... honestly I'm running out of room in my current case, so I'm not sure yet... but I know I need some more space coming up. But I'm rambling now... lol. ) 

 

Thanks for the plugin, and I appreciate your continued work on it.

Link to post

If you turn on the Testing mode logging and select the option to write to the flash drive that should create a log file that will provide enough information to try and work out why you are getting problems.

 

Can  you also please confirm what version of the plugin you are running?

 

Link to post
Posted (edited)

I'm using Unraid version 6.9.1 2021-03-08 with the Parity Check Timing version 2021.05.14

 

I turned on the logging. I then started a parity check. It ran for a few minutes and then it was paused by the plugin. About 20 minutes passed by and it did not restart. I ended up cancelling the parity check and turning off logging. The end file is what I've attached.

I

 

 

Edited by tnorman
Link to post

Strange the log shows that the plugin thinks the warning threshold for the drive is set to 0C at the UnRaid level :(  I’ll think about what you can provide that will help me track down how this can happen.

 

as an aside you seem to have set the pause threshold to a negative value - is this intentional?    It would mean that if everything was working as expected you would only get the pause when the temperature went ABOVE the warning value.  It would also mean you would normally get a temperature warning from UnRaid as the temperature crossed the warning level.   Normally this setting in the plugin would be a small positive value to pause at that amount below the UnRaid warning threshold.

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.