Large copy/write on btrfs cache pool locking up server temporarily


Recommended Posts

19 hours ago, limetech said:

There are 126 posts in this topic, can someone please write a tldr?

IIRC the issue is mostly for anyone using a Samsung drive because they have a different NAND erase block size and partitions starting on sector 64 aren't optimal.

 

https://forums.unraid.net/topic/58381-large-copywrite-on-btrfs-cache-pool-locking-up-server-temporarily/?do=findComment&comment=641245

 

also posting the post he links to since it's not going directly there:

https://forums.unraid.net/topic/44104-unassigned-devices-managing-disk-drives-and-remote-shares-outside-of-the-unraid-array/?do=findComment&comment=640178

 

  • Thanks 1
Link to comment
  • 3 weeks later...

I'm stumbling across this issue just before I setup my new build for unRAID. I have 2 970 EVO PLUS 500GB's I was going to use a cache pool using BTRFS. Guess I won't bother with that, can't hardware RAID them with current hardware, not going to get anything else to do that. I want redundancy with them, that's why I bought 2. Should I just get different SSD's? If so, which ones are good with no issues?

Link to comment
9 hours ago, Iceman24 said:

I'm stumbling across this issue just before I setup my new build for unRAID. I have 2 970 EVO PLUS 500GB's I was going to use a cache pool using BTRFS. Guess I won't bother with that, can't hardware RAID them with current hardware, not going to get anything else to do that. I want redundancy with them, that's why I bought 2. Should I just get different SSD's? If so, which ones are good with no issues?

If you were planning to use them for this specific purpose and have the option to return them still, I'd say to do that. Right now there is no idea if/when this issue will be resolved. 

 

As far as what to replace them with, I think anything non-Samsung will do. I believe the issue is specific to their drives, but someone correct me if I'm wrong.

  • Like 1
Link to comment
7 minutes ago, deusxanime said:

If you were planning to use them for this specific purpose and have the option to return them still, I'd say to do that. Right now there is no idea if/when this issue will be resolved. 

 

As far as what to replace them with, I think anything non-Samsung will do. I believe the issue is specific to their drives, but someone correct me if I'm wrong.

I have some 1tb Adata drives and have the issue. I figured it was because they are kind of cheaper drives. I bought some 860s to swap them out with. If I don't use them here, I have plenty of places to use them

  • Like 1
Link to comment
19 minutes ago, FearlessUser said:

I have some 1tb Adata drives and have the issue. I figured it was because they are kind of cheaper drives. I bought some 860s to swap them out with. If I don't use them here, I have plenty of places to use them

Good to know, I thought it was only Samsung drives affected. Definitely want to do some research before purchasing new ones then to be sure they'll work correctly. I blew $600 (at the time) on two 850 EVO 1TB drives specifically to use as my unRAID cache drives in a mirror and was quite frustrated that it didn't work (and still doesn't a couple years later!). Hopefully others will be spared the pain and expense.

  • Like 1
Link to comment

I can dump one and keep one as XFS single cache drive for now, maybe adding another later if issue is resolved. I will need regular backups of data though from the cache drive that will house Dockers, etc.

 

Edit:

I'd much rather get drives that work, but which ones are those? I can't find an answer.

Edited by Iceman24
  • Like 1
Link to comment

I would like to chime in here as well.

 

I have 2x1 TB NVME drives in a RAID1 using BTRFS. My radarr/sab downloads also all sit on the cache. During heavy downloading my iowait also goes as high as 40%. All dockers become unusable during this time. Running 6.7.2.  

 

System resources are not a problem with 64 GB of ram and a Ryzen 3900x, it seems to be the implementation of RAID1 Btrfs cache pools.

Edited by bobo89
Link to comment
3 hours ago, bobo89 said:

I would like to chime in here as well.

 

I have 2x1 TB NVME drives in a RAID1 using BTRFS. My radarr/sab downloads also all sit on the cache. During heavy downloading my iowait also goes as high as 40%. All dockers become unusable during this time. Running 6.7.2.  

 

System resources are not a problem with 64 GB of ram and a Ryzen 3900x, it seems to be the implementation of RAID1 Btrfs cache pools.

 

Everyone posting in here running 6.7.2 should upgrade to 6.8 Stable to at least remove the chance of your slowdown being from the "writes starves reads" bug.

Link to comment
  • 2 weeks later...

So it's been a couple of years and this is still an issue?  That's unfortunate.  I'm in the process of building a new 6.8 server and was planning on using a couple Samsung SSD drives for a cache pool.  Has anyone got that working without having the issues mentioned in this thread and if so using what SSD drives?  Thanks!

  • Like 1
Link to comment

I'm using two Samsung 860 EVO 1TB drives in my cache pool in Raid1 and the server is NOT locking up for me when I transfer large files.  I already bought the drives before I saw this thread, but can still return them.  I like tweaking and tuning stuff so I was trying to reproduce the issues others are seeing in this thread before making the decision to possibly return the drives.  I can copy a 50GB file to the cache pool and don't see any issues.

 

My main Unraid server is still running 5.0.  I recently upgraded my backup server from 5.0rc11 to 6.8.  Also, I swapped the case from a 4U Norco 4020 to a silent mid tower because I'm relocating the server to a different location (noise is an issue) and added the SSDs.  

 

I installed a bunch of docker containers and a couple of VMs.  Tonight when I shutdown the server to add the second cache drive, after restart my VMs are no longer visible in the GUI.  Don't know why, started another thread on that issue here:  

 

 

My Hardware Components:
CPU: Intel Xeon E3-1220 Sandy Bridge
Motherboard: Supermicro X9SCM-IIF-O
RAM: 32GB - 4x Super Talent DDR3-1333 8GB ECC Micron
Controllers: 1x IBM M1015.  Flashed in IT mode.
Case: Antec P101 Silent
Power Supply: CORSAIR HX750
Flash: 4GB Cruzer Micro
Parity Drive: 1x4TB Seagate ST4000DM000 5900RPM 64MB 4x1000GB CC43
Data Drives:  5x4TB Seagate ST4000DM000 5900RPM 64MB 4x1000GB CC43
Cache Drives: 2x1TB Samsung SSD 860 EVO 1TB
 

Hard drives are connected to the M1015.  SSDs are connected to SATA3 ports on the motherboard.

 

Multiple times I copied a 50GB file from a Win10 PC to my Unraid server over gigabit ethernet:

 

transfer.jpg.28377d8610c8af89a63d3ab0a47d980b.jpg

 

Cache pool during transfer:

1285968676_cachepool.thumb.jpg.258b976d424bf451264f62c48ba5f944.jpg

 

Top during transfer:

top.jpg.e234fa4b2ad6f39a454477a608a4d361.jpg

 

So, during the transfer I was at about 2 load average, highest I saw was ~3.  I still need to figure out what's going on with the VMs, so I couldn't test with those.  But during the transfer I used several docker containers and didn't notice any performance impacts, including:

  • Krusader - browsing files/folders on the server
  • CouchDB - exploring the GUI/interface
  • dukuwiki - Editing wiki pages
  • Oracle Database - browsing with the console

 

Everything appears to be working for me with 2 Samsung SSDs in my cache pool while copying large files.  Should my test have reproduced the problem others are seeing?  Anything else I can/should try? 

 

Best Regards,

Jimmy

 

Edited by JimmyJoe
Link to comment
  • 2 weeks later...

I had similar symptoms, using an older Samsung 830 SSD as a single Btrfs LUKS-encrypted cache.  When copying very large file, iowait would hit the 80's and then at some point the system became unresponsive, and write speeds were around 80 MB/s.  Howerver, moving to XFS LUKS-encrypted did not help things at all.

 

In my case, it had to do with LUKS-encryption.  Moving to non-encrypted cache, either Btrfs or XFS, iowait would be much lower, and write speeds at 200.  However, I'm on an i7-3770 which has AES acceleration and have barely any CPU utilization  

 

One guess is that the 830 controller doesn't handle incompressible data as well, but looking at reviews, that's where it shined compared to Sandforce controllers.  

 

Some searching lead me to this post:

 

Quote

For large writes, the default multiqueue scheduler can end up filling multiple queues of sequential IO that look like random IO (to some devices that have trouble with internal multiqueue scheduling), so it may be worth trying the "none" queuing algorithm to see if this improves things.

Setting the IO Scheduler to none for my cache drive helped a bit, but lowering nr_requests with any IO scheduler helped more, at least in my case.

Edited by robobub
Link to comment
  • 1 month later...

Exact same issue happening to me.  Server locks up completely when copying to BTRFS cache drive (single drive)

 

Seeing IOWAIT up to 50% plus

 

Samsung 850 Pro 2TB SSD using motherboard SATA

 

Raised in bug section as a problem.

 

Frankly surprised this doesn't appear to be getting looked into by LT, given how Samsung make arguably the most popular SSDs in the world?

Edited by sdamaged
  • Like 1
Link to comment
10 hours ago, sdamaged said:

Exact same issue happening to me.  Server locks up completely when copying to BTRFS cache drive (single drive)

 

Seeing IOWAIT up to 50% plus

 

Samsung 850 Pro 2TB SSD using motherboard SATA

 

Raised in bug section as a problem.

 

Frankly surprised this doesn't appear to be getting looked into by LT, given how Samsung make arguably the most popular SSDs in the world?

LT dropped by once and asked for a summary, then crickets. Try emailing them and linking to this thread

Link to comment
  • 4 weeks later...

Hi there,


i just started with Unraid but i am also affected - i have 2x 1TB 860 qvo SSD's

My IO wait goes >60 sometimes and the server locks up almost fully. During rebalance etc i see 2 x 500 Mbyte/s so bandwidth or controller is hardly an issue.


I tried configuring the ssd's as raid1 and raid0, same issue. Did try to figure out how to change it to XFS, but unfortunately i found out, that the btrfs raid1 did not work as expected - and so i am currently re-playing the backups & downloading meta data :( This is very annoying!
I hope this gets fixed soon! Can't be so difficult to allow for a partition offset ?


Server : UnraidPro 6.8.3, T620 2 x 2690v1 Xeon, 128GB, 8x8TB, 5x14TB - ssd's are on 2118IT p16 (trim enabled).

Link to comment

so this seems then also related to all the other cases when unraid seems frozen / unresponsive etc.

Why is no one looking into this ???

 

Can't be so difficult to allow a different partition offset for some disks ?

I just bought this PRO license and thought i am getting some support for this as well.

The system otherwise looks really nice and promising, but if the issues are not being fixed ??

 

 

 

Edited by ephigenie
Link to comment
On 3/29/2020 at 9:13 PM, allanp81 said:

I was seeing this with a pool of 2 512Gb SSDs. I have since switched to a single Intel NVME drive and the problem has gone.

Ok i mean this is also a possibility "just throw more money at the problem" .

However i think this should concern the Limetech Team and there needs to be a bugfix for this.

 

The docker is up, because i tried before to update "one" docker image. Took 1h, i gave up (binhex-plexpass). This is so bad.

I have a Single SSD in my old box running plain Debian and 40+ Containers (it was my previous media server) and

have never had those kind of performance issues. This is really a shame. I don't think its near anywhere acceptable

having a 128gb, dual xeon, 2 x ssd bla bla server idling there basically completely and utterly busy with himself only.

I used mergerfs in my old box before and it was performing really nice. Now i thought this does look better

and neatly integrated and for me in order not to fiddle around anymore with those things i bought into Unraid.

I just later saw unfortunately there are solutions based on ZFS as well that have emerged to have nice interfaces now as well...

And docker etc.

 

However. Now can we get this fixed please ? What more information is needed to narrow done on that bug ?

 

1433790756_Screenshot2020-03-3113_00_50.thumb.png.29ff3de80e7ee1cd3d457fa5cf6a96ee.png777445060_Screenshot2020-03-3113_00_23.png.6441e0b85f8be0db258cef34db2cb1c3.png

 

Link to comment
  • 4 weeks later...

@limetech, bumping this thread your way again, we got your attention in November but lost you since then.
Issue is, anyone using Samsung SSDs (among other brands too) in a btrfs cache pool in unraid will see performance fall off a cliff due to partitions starting on sector 64. E.g., if you transfer a large file from/to the btrfs cache pool, all the dockers in unraid will lock up.
 

@wgards, best option for now is to drop your cache down to one drive and reformat to XFS.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.