Very slow speeds moving from cache


Recommended Posts

<TLDR>I'm getting ~20MB/s read/write speeds when Mover is moving from SSD cache to array, what am I doing wrong?</TLDR>

 

Hi all,

 

I'm very new to Unraid and the forums, so I apologize in advance for any frustration I will cause with my ignorance. I'm currently evaluating Unraid to see if it is for me, and in the beginning it all looked so promising so I decided to take the plunge and move everything I have in terms of storage into Unraid. I also bought a new 1TB SSD to use as a cache drive, so now my disk setup is:

 

Parity 1: ST8000DM004 8TB SATA

Parity 2: ST8000AS0002 8TB SATA

Disk 1: ST8000DM004 8TB SATA

Disk 2: ST8000AS0002 8TB SATA

Cache: Samsung SSD 860 QVO 1TB SSD

Unassigned pass through disk: Samsung SSD 970 EVO 1TB NVMe M.2

 

After I added the cache drive, I started getting constant warnings and errors about the cache being full. I fully admit that I don't really understand how to use the cache, and certainly understood even less in the beginning. I guess I just assumed if I added a cache drive and used the "Fix Common Problems" plugin it would tell me if I was using it in a bad way. Anyway, I got frustrated with all the constant notifications about the cache, so I changed the cache settings on all shares, I don't remember to what exactly, and started Mover manually. It started moving at an extremely slow rate, and then something went wrong. Maybe it was because I changed the cache settings while Mover was running in an attempt to make a difference since it had only moved a few gigabytes after a few hours , but the result was that the system was extremely slow and unresponsive, and Mover stopped working altogether. This went on for hours and nothing changed, so I tried to shut down Unraid without success. In the end I was forced to kill it by holding the power button. After it came online, it started a parity check that took many hours, during which Mover woke up and started moving stuff from the cache. This slowed down the parity check which in the end took ~21 hours (and resulted in 6101 errors), but it got there eventually. Mover still keeps running though, and it's still extremely slow at around 20 Mb/s. I think it's scheduled to start at 03:30 in the morning, so by now (14:15 here in Sweden) it's been running for over 10 hours and it's about half way done.

 

The system is not doing anything else, the VM and Docker manager are both stopped. The cache settings right now are either "No" or "Prefer" on all shares, because I feel like I want to get rid of the cache drive. It's only causing me frustration, and as I said, I don't understand how it's supposed to work. Is there anyone who can help me figure out what I'm doing wrong? I attached diagnostics and a screenshot showing the Main-tab with read/write speeds.

 

Worth noting:
1. One of the drives, the ST8000DM004 used as disk 1 has gotten three "reported uncorrect" warnings the past few days. It's a quite new disk, so as long as this value stays at 3 I'm writing it off as one time issues caused by the troubles I've had setting the array up in the beginning.

2. The other array drive, the ST8000AS0002 used as disk 2, has write cache disabled and it ignores my attempts to enable it.

 

If you've made it this far, thank you for taking your time to read all of it. Sorry for all the faux-pas I've no doubt commited.

 

/Rickard

slowMover.png

eru-diagnostics-20200413-1326.zip

Link to comment

Ultimately, you want appdata, domains, and system shares on cache and staying on cache so your dockers and VMs performance won't be affected by the slower parity writes, and so they won't keep array disks spinning. And you have plenty of capacity on cache for that and caching some user share writes if you want.

 

37 minutes ago, rockard said:

The cache settings right now are either "No" or "Prefer" on all shares

None of your shares are prefer. Looks like the only ones that have contents on cache are Yes, and that is good for now. You will make appdata, domains, and system prefer eventually so they will get moved to cache.

 

Cache has plenty of space now, no telling how you were filling it before since you have changed your settings. Were you trying to cache the initial data load?

 

Mover is intended for idle time, so if you were trying to write to the array and move at the same time then mover would have been competing with other things for access to the disks. Maybe you were even doing a parity check or something like that that would have slowed things down.

 

Enough about cache and mover for now though. Your syslog says you were trying to correct dual parity, and getting read errors on disk1.

 

Shut down, check all connections, power and SATA, both ends, including any power splitters. Then start back up and do an extended SMART test on disk1.

Link to comment

Hi,

 

Thanks for your reply! I really appreciate it!

37 minutes ago, trurl said:

Ultimately, you want appdata, domains, and system shares on cache and staying on cache so your dockers and VMs performance won't be affected by the slower parity writes, and so they won't keep array disks spinning. And you have plenty of capacity on cache for that and caching some user share writes if you want.

This is what I mean with "I don't understand how to use the cache". Wouldn't that leave my always-on dockers and VMs completely unprotected, since mover won't move things that are in use?

39 minutes ago, trurl said:

None of your shares are prefer. Looks like the only ones that have contents on cache are Yes, and that is good for now. You will make appdata, domains, and system prefer eventually so they will get moved to cache.

Sorry, I meant they are Yes, and I set it as such because the description is "Mover transfers files from cache to array" which is what I want if I want to get rid of the cache disk.

42 minutes ago, trurl said:

Cache has plenty of space now, no telling how you were filling it before since you have changed your settings. Were you trying to cache the initial data load?

It has space now because everything has been off for over 24 hours and mover has been running for over 10 hours, moving things off the cache. I don't know what you mean by "cache the initial data load", but I haven't tried to do anything actively other than make it stop filling up, and stop being extremely slow. In fact, I turned off every VM and docker image, and the cache disk kept filling up. Again, I don't understand how to use it. If I tell the domains share to use the cache, and I have a 1TB vdisk in there, would that not take all the cache space immediately?

48 minutes ago, trurl said:

Mover is intended for idle time, so if you were trying to write to the array and move at the same time then mover would have been competing with other things for access to the disks. Maybe you were even doing a parity check or something like that that would have slowed things down.

So if I have always-on VMs and docker images that is constantly accessing it's data, a cache disk is pretty much useless? I mean, if I expect there to be something on my Unraid system that is writing to the array at all hours of the day, there will be no "idle time". So every time mover runs, I will have to expect it to move things at 20 Mb/s, and also making my system extremely slow and unresponsive during that time. That is the opposite expected effect of having a cache for speed. This is kind of what I'm leaning at, and it is why I'm trying to get rid of it.

 

What I don't understand, though, is that the system cannot be any more idle than it is right now, so why is it so extremely slow? Nothing else than mover is running, and yet it keeps moving files at ~20Mb/s.

 

55 minutes ago, trurl said:

Enough about cache and mover for now though. Your syslog says you were trying to correct dual parity, and getting read errors on disk1.

 

Shut down, check all connections, power and SATA, both ends, including any power splitters. Then start back up and do an extended SMART test on disk1.

Ok, thanks for the suggestion! I'll definitely do that, but I'll have to wait for mover to finish, which at this rate will be sometime during midnight. Thank you very much for taking the time to answer and trying to help, much appreciated!

 

/Rickard

Link to comment
3 hours ago, rockard said:

Wouldn't that leave my always-on dockers and VMs completely unprotected, since mover won't move things that are in use?

You don't want them moved to the array. You want them to stay on cache so they will perform better and so they won't keep array disks spinning. You can have multiple disks in the cache pool for redundancy, raid1 mirror is the default. You can also backup some of the things that stay on cache with the CA Backup plugin.

 

3 hours ago, rockard said:

Sorry, I meant they are Yes, and I set it as such because the description is "Mover transfers files from cache to array" which is what I want if I want to get rid of the cache disk.

You can run totally without cache if you want, and arguably it would be simpler. But there are definite advantages to having cache as mentioned.

 

3 hours ago, rockard said:

I don't know what you mean by "cache the initial data load"

Some people have set things up where they were trying to cache everything they wrote to the server, because it would be faster. But if you start that way and have a large amount of data to write at the beginning, such as transferring all files from another system (initial data load), then of course you will fill cache because it won't have enough capacity, and mover will just get in the way if you try to make it move more often because it will be competing for the same disks you are writing. Mover is intended for idle time. There is simply no way to move from cache to the slower array as fast as you can write to the faster cache.

 

3 hours ago, rockard said:

the cache disk kept filling up

All writes to any user share with cache-prefer or cache-yes will go to cache. And mover will ignore cache-no or cache-only user shares, so just setting a share to cache-no won't help get things off cache that are already there.

 

3 hours ago, rockard said:

If I tell the domains share to use the cache, and I have a 1TB vdisk in there, would that not take all the cache space immediately?

If you really have some good reason to make a vdisk that large then maybe you would have to make domains not cached. But I think it is pretty rare to make a vdisk that large. Your VMs can access your Unraid storage for general file storage and the vdisk would normally just be for the VM OS.

 

3 hours ago, rockard said:

So if I have always-on VMs and docker images that is constantly accessing it's data, a cache disk is pretty much useless? I mean, if I expect there to be something on my Unraid system that is writing to the array at all hours of the day, there will be no "idle time". So every time mover runs, I will have to expect it to move things at 20 Mb/s, and also making my system extremely slow and unresponsive during that time. That is the opposite expected effect of having a cache for speed. This is kind of what I'm leaning at, and it is why I'm trying to get rid of it.

I think you have missed most of my point. VMs and dockers can access whatever they need to on the array or cache. It is the appdata, vdisks (domains), and images (system) that I was talking about keeping on cache since those files will always be open regardless of what the dockers and VMs are doing.

 

I think some of your slowness is likely the result of your disk1 problems, since disk1 is the default target until it gets to the highwater mark.

 

I actually don't cache much since most of the writes to my server are from scheduled backups and queued downloads, so I am not waiting for them to complete anyway. Those all go to cache-no user shares so they are written directly to the array where they are already protected and don't need to be moved.

 

But, I still use cache for the reasons already discussed, for these shares:

  • appdata - docker working storage; for example, the plex database (but not the media files themselves, which are on other user shares),
  • domains - the VM OS (but for general storage just go to the Unraid user shares), and
  • system shares - the libvirt image and the docker image (where the container executable code lives)

 

3 hours ago, rockard said:

What I don't understand, though, is that the system cannot be any more idle than it is right now, so why is it so extremely slow? Nothing else than mover is running, and yet it keeps moving files at ~20Mb/s.

Maybe the same disk1 problems that were in your syslog when you were doing the dual parity correction. Plus the overhead of writing to the parity array, though it should be probably be faster than that unless you are moving a lot of small files. What dockers were you running before? Plex appdata is notorious for having a lot of small files.

 

3 hours ago, rockard said:

I'll definitely do that, but I'll have to wait for mover to finish, which at this rate will be sometime during midnight.

You might try stopping mover:

 

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=554749

 

and then fix your other problems first.

4 hours ago, trurl said:

Shut down, check all connections, power and SATA, both ends, including any power splitters. Then start back up and do an extended SMART test on disk1.

5 hours ago, rockard said:

parity check which in the end took ~21 hours (and resulted in 6101 errors)

After you get disk1 sorted out, you must run another correcting parity check. Exactly zero parity errors is the only acceptable result.

 

Link to comment

Hi again,

 

Thanks for an extensive answer! My mover finally finished, and after reseating all cables in both ends I booted the machine and started an extended SMART test. It's been running for over an hour now and completed 10%, so I'll leave it over night again and will probably have to kill it in the morning.

 

My issue with slow mover speeds is moot, I guess, since it finally finished and I will get rid of the cache or move to something other than Unraid, so the following are only my own reflections on cache drives in Unraid, and can be ignored.

 

Thanks for explaining how the cache drive is used. I'm not sure I really missed your points, it's just that I don't want to have my appdata any less protected than my other data. I don't have enough connectors on my motherboard to connect another cache drive, and even if I did, I think it's absolutely not worth it. I have two parity disks to make sure I can suffer two disk failures and still have all my data intact. To upgrade to a cache pool, I would have to spend more money, and only get protection from one disk failure. It seems to me that you can't get uninterupted uptime (by not having to shut down VMs/dockers), protection against data loss (by writing to a parity-protected array) and speedy transfers (by using a SSD cache) with Unraid, and I must say I'm really surprised by that. The backup app you mention also requires the docker to be shut down to be backed up.

3 hours ago, trurl said:

There is simply no way to move from cache to the slower array as fast as you can write to the faster cache.

That is absolutely true, and I'm not asking for that. A cache should, in my mind, be a subset of the underlying data for faster reads and/or writes. When I hear somebody talk about a cache, I think of a write-through or write-back cache, where your data always reaches the underlying storage either immediately (at the speed of the underlying storage) or at a later time (at the speed of the cache, up to the size of the cache). I've never thought you would have a cache that doesn't write back everything that is written to it to the underlying storage at least at some point.

 

Anyway, this journey has been really interesting, and I'm thankful you took the time to try to help me. Cheers!

 

/Rickard

Link to comment
  • 2 weeks later...

Sigh. So I decided to try to make use of my cache drive after all, I thought that there must be some way it can be useful. So I stopped all my docker containers, moved the storage paths that I cannot have unprotected (like my database) out of the appdata share, made a backup with the backup app mentioned, set appdata share to prefer cache and ran Mover. The speeds were not impressive, but acceptable, so I decided to keep this setup. Because of the hard resets mentioned in my other posts, parity checks have been running more or less constantly since then, and since I used the Mover Tuning app to stop Mover from running during a parity check, tonight was the first night it ran. And. Sigh. It's still running. At a whooping 90 kB/s at the moment I'm writing this. So. Once again, I'm wondering: what am I doing wrong? I've attached another diagnostics zip.

eru-diagnostics-20200428-0849.zip

Link to comment

Nothing obvious in syslog. I see disk1 completed extended test without error. Have you tried extended tests on the other disks?

7 hours ago, rockard said:

Because of the hard resets mentioned in my other posts, parity checks have been running more or less constantly since then

Don't understand this part. Do you mean you are repeatedly running parity checks?

 

You mention not having enough ports. Most would argue 2 cache pool is more useful than dual parity when you only have 2 data disks.

 

But since you are still having problems, I am wondering. Anything unusual about how these disks are installed and connected?

 

Link to comment

Thanks for responding! Good point, I haven't tried extended tests on the other disks, I'll start it immediately! It took a long time to finish the last time, so I expect it to run overnight and be ready some time tomorrow.

 

3 hours ago, trurl said:
11 hours ago, rockard said:

Because of the hard resets mentioned in my other posts, parity checks have been running more or less constantly since then

Don't understand this part. Do you mean you are repeatedly running parity checks?

Sorry for being unclear, I'll describe the timeline and hope that clears it up:

I was forced to do a hard reset due to a docker stuck waiting for IO (as mentioned in my other post), and that in turn forced a parity check which takes around 24 hours. When that finished (with 0 errors) it didn't take long (a few hours maybe) until the same thing happened again. This time, after another ~24 hours, the parity check had found and corrected 1 793 sync errors, and another SMART warning of "reported uncorrect" on disk1 (up to 5 from previous value of 4). So, I ran another parity check manually, and waited for another ~24 hours for that to finish. This time it came back with 0 errors, but I wanted to be sure that it wasn't a one time thing, so I ran another manual check that also came back with 0 errors. It was after this last manual check had finished that Mover was finally able to run as scheduled, and I woke up to it trodding along at ~50 kB/s.

 

4 hours ago, trurl said:

You mention not having enough ports. Most would argue 2 cache pool is more useful than dual parity when you only have 2 data disks.

 

I don't trust my disks (consumer products not intended for this kind of use), and honestly I don't trust (how I'm using) Unraid to keep my data safe if I have a setup that only allows for one disk failure. So far, using Unraid has forced me to do a number of hard resets that seems to have harmed my disks (since the "reported uncorrect" has risen), so I have no desire to swap out a parity drive for another cache drive. Especially not given the extremely underwhelming performance of cache drives in Unraid in my setup to this point.

 

4 hours ago, trurl said:

But since you are still having problems, I am wondering. Anything unusual about how these disks are installed and connected?

 

Nothing unusual that I'm aware of. I have a Fractal Design Define R5 case with the disks mounted in the disk bay. They are connected via SATA cables to the SATA ports of my ASUS Z170 Pro Gaming motherboard, and to my EVGA SuperNOVA G2 850W powersupply via the supplied power cables.

 

Thanks again for trying to help me!

 

/Rickard

Link to comment
4 hours ago, rockard said:

So far, using Unraid has forced me to do a number of hard resets that seems to have harmed my disks (since the "reported uncorrect" has risen)

That will not harm the disks, but the fact it is increasing might mean the disk isn't reliable. Keep an eye on it.

 

As for the hangups, have you done memtest?

 

 

Link to comment

Alright, good tip about Syslog Server! I just enabled it, so hopefully there will be better diagnostics next time it happens! Thank you!

 

7 hours ago, trurl said:

As for the hangups, have you done memtest?

 

No, I haven't. My rendition of what happens is the my Plex docker is waiting for disk io, caused in some way by my gaming VM trying to reserve something that is in use by docker. I don't think memory has anything to do with it, but I am in no position to rule anything out. I'll do that when there's a chance, still waiting for extended SMART reports.

Link to comment
  • 3 weeks later...

Memtest and extended SMART reports found no problems, but nevertheless reported uncorrect kept rising and Unraid reported read errors on the disk, so I decided I wanted to remove it from the array. I found https://wiki.unraid.net/Shrink_array and started to follow the steps in "The "Clear Drive Then Remove Drive" Method". After 48+ hours finally the "clear an array drive" script finished, and I wanted to continue with the next step. However, at this point I find that the disk has been mounted again and a parity check is running. According to the main tab, the fs size is 16 GB, and 1.21 GB is used. A ls -al at /mnt/disk1 shows no files though. How did this happen, and more importantly, how do I continue? Please don't tell me I need to start over with the clearing script, it's getting ridiculous, I think my uptime/downtime ratio is getting close to 1:10 at this point. Submitted another diagnostics report.

eru-diagnostics-20200518-1836.zip

Link to comment
  • 3 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.