unraid-tunables-tester.sh - A New Utility to Optimize unRAID md_* Tunables


Recommended Posts

While your comments are very interesting, they are really outside the scope of the tunables tester.

 

Going back to the car analogy, the tunables tester is trying to identify the best gas and oil to use to get the manufacturer's rated horsepower.  You bought a car with 300 horsepower, but for some reason you're only getting 150 hp, so we test all the variables to find out why you're not getting what you bought.  Perhaps your car needs 93 octane and 5w40, but you've been running 87 octane and 10w30.  Or maybe you need a tuneup with new spark plugs.  Simple fixes to eliminate performance issues, and each car might have slightly different issues.

 

What you're talking about is swapping out camshafts, porting and polishing, upgrading the fuel system, and adding NOS, trying to get the maximum amount of power from the engine and hoping it doesn't blow.

 

The tunables tester should be limited to testing the user settable tuning values that Limetech provides access to in the GUI.  Hence the name - Unraid Tunables Tester.

 

Everything else that you're talking about sounds like a discussion you should be having with Limetech, or developing brand new tuning tools way beyond the scope of the tunables tester.

 

Years ago, when I upgraded from v4.x to v5.x, my parity check times immediately increased from 6.5 hours to 12+ hours.  I was fine with 6.5 hours, but knew there was a problem with the nearly doubled parity check times.  Thus the tunables tester was born, and I was able to get back to my 6.5 hour parity checks.  My system does everything I ask of it, and while it's certainly not the fastest server in the world, performance is fine and I can watch movies while a parity check is running with no stuttering.  I can simultaneously run 4 Windows 2016 server VM's too.  Not sure what else I could ask for.

 

I'm the type of person that is not interested in modifying hidden tuning parameters trying to eek out a bit of extra performance beyond what stock Unraid offers, and I would also be slow to upgrade to a new Unraid version that has made major changes to the I/O scheduler.  My 67 TB of data, and my time not spent doing data recovery, is much more important to me than a bit more performance that I probably won't even notice.  I'm still running 6.6.6 because I've seen things in 6.6.7 and 6.7.x that have me waiting for a more robust update.

 

Don't get me wrong, I think your ideas are great and I hope you pursue them.  If you can get Limetech to implement these enhancements in core Unraid, that would benefit us all, and after a few months of letting everyone else beta test it for me, I would happily upgrade for the free performance boost.

 

But elevating Unraid to a higher level of performance is not what the tunables tester is about - it's about fixing problematic settings that are dramatically hurting performance on certain machines.

 

Link to comment
2 hours ago, Pauven said:

While your comments are very interesting, they are really outside the scope of the tunables tester.

 

Going back to the car analogy, the tunables tester is trying to identify the best gas and oil to use to get the manufacturer's rated horsepower.  You bought a car with 300 horsepower, but for some reason you're only getting 150 hp, so we test all the variables to find out why you're not getting what you bought.  Perhaps your car needs 93 octane and 5w40, but you've been running 87 octane and 10w30.  Or maybe you need a tuneup with new spark plugs.  Simple fixes to eliminate performance issues, and each car might have slightly different issues.

 

What you're talking about is swapping out camshafts, porting and polishing, upgrading the fuel system, and adding NOS, trying to get the maximum amount of power from the engine and hoping it doesn't blow.

 

The tunables tester should be limited to testing the user settable tuning values that Limetech provides access to in the GUI.  Hence the name - Unraid Tunables Tester.

 

Everything else that you're talking about sounds like a discussion you should be having with Limetech, or developing brand new tuning tools way beyond the scope of the tunables tester.

 

Years ago, when I upgraded from v4.x to v5.x, my parity check times immediately increased from 6.5 hours to 12+ hours.  I was fine with 6.5 hours, but knew there was a problem with the nearly doubled parity check times.  Thus the tunables tester was born, and I was able to get back to my 6.5 hour parity checks.  My system does everything I ask of it, and while it's certainly not the fastest server in the world, performance is fine and I can watch movies while a parity check is running with no stuttering.  I can simultaneously run 4 Windows 2016 server VM's too.  Not sure what else I could ask for.

 

I'm the type of person that is not interested in modifying hidden tuning parameters trying to eek out a bit of extra performance beyond what stock Unraid offers, and I would also be slow to upgrade to a new Unraid version that has made major changes to the I/O scheduler.  My 67 TB of data, and my time not spent doing data recovery, is much more important to me than a bit more performance that I probably won't even notice.  I'm still running 6.6.6 because I've seen things in 6.6.7 and 6.7.x that have me waiting for a more robust update.

 

Don't get me wrong, I think your ideas are great and I hope you pursue them.  If you can get Limetech to implement these enhancements in core Unraid, that would benefit us all, and after a few months of letting everyone else beta test it for me, I would happily upgrade for the free performance boost.

 

But elevating Unraid to a higher level of performance is not what the tunables tester is about - it's about fixing problematic settings that are dramatically hurting performance on certain machines.

 

Indeed. I initially looked at this post for a hopeful answer to abysmal parity check times.
For me, it takes just shy of 1.5 days to complete a parity check on 24 8tb drives, totalling 176TB with dual parity bringing the raw platters to 192tb. 
Obvioulsy I have roughly 3 times your surface area. Which should come out to around 18-20 hours of parity check time, assuming similar speeds through and through. 8TB drives are substantially faster thanks to the higher density, and various mechanial and controller improvements. I should be able to read at ~200MB/s off each drive. Thanks to a port expander bottleneck ( Math found the answer - 2 * 4 links = 8 links * 3gb/s = 24gb/s / 24 drives = 1gb/s per drive = ~125mb/s give or take maximum theoretical bi-directional. ) that is not the case.
But even knowing that, my speeds are still slower than they should be - around 60mb/s for the majority of the parity check. Something else seems fishy here. Running the script yielded ~10mb/s improvement for me, so I'll take what I can get until I can afford to toss money at the real problem.

 

 

And yes, the things I want to focus on are outside of the scope of this script - which is another reason I don't want to adopt it as a project.

Also; the I/O scheduler has nothing to do with the actual data sent to/from the disk - just when and where that data is sent or retrieved. Unraid is currently shipped with noop as the default scheduler for all block devices. Noop is generally bad for rotational media, as it doesn't do any planning for storing the data contiguously on the platters. This would be suitable for the parity disks, since that data will largely be unorderly anyways. But for the disks storing filesystem data, it can have a negative impact on large file read performance (seeks will occur in large files) and write performance (holes will be scattered, resulting in excessive seeks for writes) over time. It seems like an odd decision on limetech's part. They haven't included any advanced schedulers in their kernel build - so for me to test I'll need lots of time. My plan is to enable DKMS so that nvidia-*-dkms packages can be leveraged (to support more than current-generation only cards) as well as trying to leverage BFQ and/or Kyber. I'll test that on a smaller, non-important box though, when I have the time.

Link to comment
12 minutes ago, Xaero said:

For me, it takes just shy of 1.5 days to complete a parity check on 24 8tb drives, totalling 176TB with dual parity bringing the raw platters to 192tb. 
Obvioulsy I have roughly 3 times your surface area. Which should come out to around 18-20 hours of parity check time, assuming similar speeds through and through. 8TB drives are substantially faster thanks to the higher density, and various mechanial and controller improvements. I should be able to read at ~200MB/s off each drive. Thanks to a port expander bottleneck ( Math found the answer - 2 * 4 links = 8 links * 3gb/s = 24gb/s / 24 drives = 1gb/s per drive = ~125mb/s give or take maximum theoretical bi-directional. ) that is not the case.
But even knowing that, my speeds are still slower than they should be - around 60mb/s for the majority of the parity check.

IF you are using older Hardware, you should read this:

 

   https://forums.unraid.net/topic/49598-unraid-server-release-620-rc4-available/#comment-487759

 

 

Dual parity is a very complicated mathematical bit of matrix algebra and requires a lot of CPU horsepower if it has be done by brute force.  But there were some special CPU instructions added to the CPU to do this calculation which reduces the number of CPU cycles required.   (Intel and AMD realized that these functions were needed if they wanted to supply the CPU's for the big server farms most of which are apparently using dual parity.) 

Edited by Frank1940
Link to comment
1 hour ago, Frank1940 said:

IF you are using older Hardware, you should read this:

 

   https://forums.unraid.net/topic/49598-unraid-server-release-620-rc4-available/#comment-487759

 

 

Dual parity is a very complicated mathematical bit of matrix algebra and requires a lot of CPU horsepower if it has be done by brute force.  But there were some special CPU instructions added to the CPU to do this calculation which reduces the number of CPU cycles required.   (Intel and AMD realized that these functions were needed if they wanted to supply the CPU's for the big server farms most of which are apparently using dual parity.) 

I have dual e5-2658v3's. I believe they should be sufficient.

Link to comment

Just tried this for the first time after procrastinating... I tried it in "Very Fast" mode just now per the instructions and getting these errors:

unRAID Tunables Tester v2.2 by Pauven
./unraid-tunables-tester.sh: line 80: /root/mdcmd: No such file or directory

./unraid-tunables-tester.sh: line 388: /root/mdcmd: No such file or directory
./unraid-tunables-tester.sh: line 389: /root/mdcmd: No such file or directory
./unraid-tunables-tester.sh: line 390: /root/mdcmd: No such file or directory
./unraid-tunables-tester.sh: line 394: /root/mdcmd: No such file or directory
./unraid-tunables-tester.sh: line 397: /root/mdcmd: No such file or directory
./unraid-tunables-tester.sh: line 400: [: : integer expression expected
Test 1 - md_sync_window=384 - Test Range Entered - Time Remaining: 1s ./unraid-tunables-tester.sh: line 425: /root/mdcmd: No such file or directory
./unraid-tunables-tester.sh: line 429: /root/mdcmd: No such file or directory
Test 1   - md_sync_window=384  - Completed in 4.011 seconds   =   0.0 MB/s

 

Link to comment
Just now, tmchow said:

Just tried this for the first time after procrastinating... I tried it in "Very Fast" mode just now per the instructions and getting these errors:


unRAID Tunables Tester v2.2 by Pauven

 

You are using the version from the OP. It's broken. I posted a fixed version of the original, but note that its not as effective on Unraid 6.x as the tunables have changed.

Link to comment
2 minutes ago, Xaero said:

You are using the version from the OP. It's broken. I posted a fixed version of the original, but note that its not as effective on Unraid 6.x as the tunables have changed.

Forgive me, but how do I find your script? There's no search within this thread and dont' want to just keep going page by page through 32 pages :)

Link to comment

My parity checks are currently taking 14 hours, and reported to run at approx 109-114MB/s. Here was one recent summary:

 

 

Event: Unraid Parity check
Subject: Notice [TOWER] - Parity check finished (0 errors)
Description: Duration: 15 hours, 15 minutes, 19 seconds. Average speed: 109.3 MB/s
Importance: normal

 

After running this tunables script last night, looks like I'm barely going to get any perf improvements but the stated speeds (both Bang for buck and unthrottled) are way faster what my parity speeds run at.   What's the reason for the disparity?

 

*******************************************************************************
Completed: 2 Hrs 9 Min 5 Sec.

Best Bang for the Buck: Test 1 with a speed of 135.0 MB/s

     Tunable (md_num_stripes): 1408
     Tunable (md_sync_window): 512

These settings will consume 33MB of RAM on your hardware.


Unthrottled values for your server came from Test 27 with a speed of 138.6 MB/s

     Tunable (md_num_stripes): 2968
     Tunable (md_sync_window): 1336

These settings will consume 69MB of RAM on your hardware.
This is 39MB more than your current utilization of 30MB.
NOTE: Adding additional drives will increase memory consumption.

In unRAID, go to Settings > Disk Settings to set your chosen parameter values.
*******************************************************************************

*******************************************************************************
* It is estimated that the Best Bang for the Buck values will provide 99% of  *
* the performance that the Unthrottled values will deliver, plus they provide *
* much lower memory consumption. The Best Bang for the Buck values may be the *
* smarter choice, especially if you run 3rd party plug-ins and add-ons that   *
* compete for memory.                                                         *
*******************************************************************************

 

Edited by tmchow
Link to comment
2 minutes ago, tmchow said:

After running this tunables script last night, looks like I'm barely going to get any perf improvements but the stated speeds (both Bang for buck and unthrottled) are way faster what my parity speeds run at.   What's the reason for the disparity?

The parity check times are an average, and ultimately include the much slower reads from the outer cylinders of the drives.  The tunable only runs for x number of minutes, and effectively only tests the fastest part of the drives.

Link to comment
On 7/20/2019 at 10:42 PM, Xaero said:

For me, it takes just shy of 1.5 days to complete a parity check on 24 8tb drives, totalling 176TB with dual parity bringing the raw platters to 192tb. 
Obvioulsy I have roughly 3 times your surface area. Which should come out to around 18-20 hours of parity check time, assuming similar speeds through and through.

 

Almost 36 hours is indeed too long.  My mixture of 5400 RPM 3TB & 4TB drives, and 7200 RPM 8TB drives finishes in 18.5 hours.  A pure 7200 RPM 8TB setup should complete in under 16.5 hours.  Even slower 5400 RPM 8TB drives should finish in under 22 hours.

 

On 7/20/2019 at 10:42 PM, Xaero said:

Thanks to a port expander bottleneck ( Math found the answer - 2 * 4 links = 8 links * 3gb/s = 24gb/s / 24 drives = 1gb/s per drive = ~125mb/s give or take maximum theoretical bi-directional. ) that is not the case.
But even knowing that, my speeds are still slower than they should be - around 60mb/s for the majority of the parity check. Something else seems fishy here. Running the script yielded ~10mb/s improvement for me, so I'll take what I can get until I can afford to toss money at the real problem.

 

I'm not sure that your math accounts for typical communication protocol overhead, nor the inefficiency inherent in port expanders.  I'm happy you did find a 10MB/s improvement.  This is exactly the reason I avoided port expanders.  60-70 MB/s sounds close to what I would expect.

 

2 hours ago, tmchow said:

After running this tunables script last night, looks like I'm barely going to get any perf improvements but the stated speeds (both Bang for buck and unthrottled) are way faster what my parity speeds run at.   What's the reason for the disparity?

 

2 hours ago, Squid said:

The parity check times are an average, and ultimately include the much slower reads from the outer cylinders of the drives.  The tunable only runs for x number of minutes, and effectively only tests the fastest part of the drives.

 

This is absolutely correct.  Though I think many users struggle to understand this without a visual:

 

image.thumb.png.b2e8a8dadf49ae81328401535fce6caa.png

 

This chart is from a HDD review, showing throughput speed (MB/s) over the course of the disk.  The lime green line starts off high around 200MB/s at the beginning (outside edge of the disk platter), then tapers off to around 95 MB/s at the end of the disk (inside edge of the disk platter).  This is just a random sample, and doesn't necessarily represent your drives, so this is just to show the concept of what is going on.

 

The average speed of the drive (and the resulting Parity Check) would be around 155 MB/s, not the 200 MB/s peak.

 

The Unraid Tunables Tester only tests the very beginning of the drive (i.e. the first 5-10%, where speeds are the highest).  This is way above the average speed of an entire drive, beginning to end.

 

On this chart I drew three dashed lines. 

 

The green line at the top represents Unraid Tunables set to a value that doesn't limit performance at all.  This is what we are trying to achieve with the Unraid Tunables Tester.

 

The yellow line in the middle represents how a typical system (one without any real performance issue) might perform with stock Unraid Tunables.  Notice that while peak performance is reduced from 200 MB/s to perhaps around 190 MB/s, this slight reduction is only for the first 17% of the drive, beyond which the performance is no longer limited.  A 5% speed reduction for 17% of the drive only reduces average throughput (for the entire drive) by less than 1%, so fixing this issue might only increase average throughput for the entire drive by 1-2 MB/s.  Sure, it's an improvement, but a very small one.

 

The red line at the bottom represents how some controllers have major performance issues when using the stock Unraid Tunables - like my controller.  In this case, the throughput is so constrained, over 90% of the drive performs slower than it is capable of performing.  Fixing the Tunables on my system unleashes huge performance gains.

 

Hopefully that helps show why most systems see extremely little improvement from adjusting the Unraid Tunables - these systems are already performing so close to optimum that any speed increase will hardly make a dent in parity check times.  It's only the systems that are misbehaving that truly benefit.

 

Paul

Edited by Pauven
  • Upvote 1
Link to comment
On 7/21/2019 at 3:42 AM, Xaero said:

2 * 4 links = 8 links * 3gb/s = 24gb/s / 24 drives = 1gb/s per drive = ~125mb/s

I understand by this you're using s SAS1 expander with dual link, if so you'll need to account for protocol overhead and 8b/10b encoding, of the 2400MB/s theoretical max you'll get around 2200MB/s usable, so with 24 drives around 92MB/s per drive.

Edited by johnnie.black
  • Upvote 1
Link to comment
1 hour ago, Pauven said:

Hopefully that helps show why most systems see extremely little improvement from adjusting the Unraid Tunables - these systems are already performing so close to optimum that any speed increase will hardly make a dent in parity check times.  It's only the systems that are misbehaving that truly benefit.

Yes, thank you!  I'll leave it in current settings then and not mess with it.

Link to comment

Like mentioned by Pauven this script can't currently find good tunables because it doesn't test for the new ones, but the default tunables are pretty conservative and can give bad performance especially for larger arrays, over the years I've found that the tunables below work well for most hardware configs, and worth a try mostly for larger arrays:

Settings -> Disk Settings

Tunable (md_num_stripes): 4096
Tunable (md_sync_window): 2048
Tunable (md_sync_thresh): 2000

 

  • Like 1
  • Upvote 1
Link to comment

Well, good job guys.  The conversation has prompted me to find and review my testing documentation, which includes my strategy for the next test routine.

 

And it just so happens that at this rare moment, I find myself with a bit of free time, and craving working on something different for a change of pace.

 

So I think I'm going to try implementing the new testing strategy.  I'll also take a look at Xaero's revision to see what best practices I need to apply.

  • Like 1
  • Upvote 3
Link to comment

The new logic is performing over 100 parity check start/stops.  Nothing really new there, but with Unraid v6 there is the adverse effect that these actions are logged in the parity check history, and you get separate on-screen notifications for each of these events.

 

Does anyone know of any way to temporarily disable the logging of these events to the parity check history, and/or temporarily disable the on-screen notifications?

 

If you don't know the answer, but know who might, can you help bring them into this conversation?

 

Paul

Edited by Pauven
Link to comment

I know where the configuration entry is stored, but changing this sounds a little dangerous in case something goes wrong trying to change it back.    Another possibility would be to intercept all calls to the not9fy script.    If you want to follow up on either of these it may best be done via PM?

Edited by itimpi
  • Upvote 1
Link to comment

Thought I would give a status update on the development of UTT v4.

 

@itimpi provided some great information, and I now have parity check notifications blocked for the duration of the tests.  The blocking function has safeguards built-in, so even if the script is aborted, within one minute parity check notifications are unblocked (I'm doing this with a flag file that expires after 1 minute).

 

I'm also preventing most of the parity checks from being logged in the parity check history.  I say most, because Unraid forcibly rewrites the status of the very last parity check to the log if you remove it.  Still, much better to have just a single entry instead of hundreds.

 

This next beta is pretty close to a full-rewrite, at least as compared to the last v2.2  for Unraid v5.  Tons of new functionality.

 

UTT v2.2 essentially used a one-dimensional array of test values - that's all that was needed.  For Unraid v6 and the new md_sync_thresh and nr_requests, the test results are now being logged in a psuedo-three-dimensional array.  This is much more complicated, but I finally have it working.  I'm still fleshing out some of the new tests and options, and hope to have something for public release next week.

 

Fingers crossed that all this new logic will actually provide accurate tuning parameters for all types of machines...

 

Paul

  • Like 3
Link to comment
11 hours ago, jonathanm said:

Cool! Have you tested what happens if the power is killed during a cycle? Obviously a mandatory check should happen at array start, will your safeguards deal with that ok?

 

The end-user notification system just controls emails and the on-screen popups informing you of events.  Whether or not you receive these notifications, the underlying events still occur.  The solution I've implemented defaults to allowing all notifications, and if a flag file is present, it omits the notifications for 'Unraid Parity check' events only, and it does this for a maximum of 60 seconds after the timestamp of the flag file, before reverting back to allowing all notifications.

 

But just because you aren't notified that a parity check is running doesn't mean Unraid can't do a parity check.  After all, the Unraid Tunables Tester starts and stops over a hundred partial parity checks while the parity check notifications are blocked.  UTT actually has to update the flag file hundreds of times during a test (right before every parity check start or stop) to set the current timestamp in order to block the Unraid notification that a parity check has started/stopped.

 

Whether or not the notifications are shown or blocked, normal system events like Unraid starting a parity check after a power failure will simply not be affected.  I would also expect that if there was a power failure a split second after the flag file was updated to the current timestamp, that the time for the power to return and for the server to reboot and for Unraid to automatically begin a parity check would exceed 1 minute, so not only would the auto parity check begin but the flag file would have expired and you would receive the notifications too.

 

Oh, and no I haven't tested a power failure on my server.  Never have and never will.

 

Sorry for the long answer.  I presume many users may be concerned over the prospect of having any notifications blocked, so I thought it best to explain in a bit more detail how the safeguards work.  I will also be making the notification blocking feature optional in UTT, so if a user is uncomfortable with this feature they can avoid it.

 

Paul

  • Upvote 2
Link to comment
19 minutes ago, Pauven said:

The solution I've implemented defaults to allowing all notifications, and if a flag file is present, it omits the notifications for 'Unraid Parity check' events only, and it does this for a maximum of 60 seconds after the timestamp of the flag file, before reverting back to allowing all notifications.

 

I do want to give a special credit to @itimpi for this solution.  He showed me how to safely block these notifications.  Also, my original concept was to block them for 24 hours (since a full-scale UTT test cycle can be around 20 hours), and he challenged me on this, and came up with the awesome methodology to block them for only a minute.

 

Paul

Link to comment
1 hour ago, Pauven said:

 

I do want to give a special credit to @itimpi for this solution.  He showed me how to safely block these notifications.  Also, my original concept was to block them for 24 hours (since a full-scale UTT test cycle can be around 20 hours), and he challenged me on this, and came up with the awesome methodology to block them for only a minute.

 

Paul

Glad you found the advice useful :)   If you want to remove the (very remote) possibility of a parity check notification being blocked after an unexpected reboot then your plugin installation logic (which would be run as part of the boot process) could remove the timestamp flag if it exists.    In fact if that flag is being stored in RAM rather than on the USB stick that would already be the default behaviour.

  • Upvote 1
Link to comment
17 minutes ago, itimpi said:

Glad you found the advice useful :)   If you want to remove the (very remote) possibility of a parity check notification being blocked after an unexpected reboot then your plugin installation logic (which would be run as part of the boot process) could remove the timestamp flag if it exists.    In fact if that flag is being stored in RAM rather than on the USB stick that would already be the default behaviour.

 

Technically, UTT is not [yet] a plugin, just a script.  There is no plugin installation logic to run automatically at boot.  While I have thought about turning UTT into a plugin, I'm not there yet.

 

What's the path to store the file in RAM?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.