Large copy/write on btrfs cache pool locking up server temporarily

November 22, 20196 yr

14 hours ago, aptalca said:

Good question

@limetech is this issue on the radar? We'd like to be able to use btrfs cache pools but the io issue is a non-starter

Thanks

There are 126 posts in this topic, can someone please write a tldr?

Quote

November 23, 20196 yr

Community Expert

19 hours ago, limetech said:

There are 126 posts in this topic, can someone please write a tldr?

IIRC the issue is mostly for anyone using a Samsung drive because they have a different NAND erase block size and partitions starting on sector 64 aren't optimal.

https://forums.unraid.net/topic/58381-large-copywrite-on-btrfs-cache-pool-locking-up-server-temporarily/?do=findComment&comment=641245

also posting the post he links to since it's not going directly there:

https://forums.unraid.net/topic/44104-unassigned-devices-managing-disk-drives-and-remote-shares-outside-of-the-unraid-array/?do=findComment&comment=640178

Quote

1

November 28, 20196 yr

i have also been having my server locking up in the morning lately. i have recently put a second SSD in my cache in the BTRFS format. here is on of the diagnostics i was able to grab before i shut the computer down.

finalizer-diagnostics-20191122-0449.zip

Quote

December 17, 20196 yr

I'm stumbling across this issue just before I setup my new build for unRAID. I have 2 970 EVO PLUS 500GB's I was going to use a cache pool using BTRFS. Guess I won't bother with that, can't hardware RAID them with current hardware, not going to get anything else to do that. I want redundancy with them, that's why I bought 2. Should I just get different SSD's? If so, which ones are good with no issues?

Quote

December 17, 20196 yr

9 hours ago, Iceman24 said:

I'm stumbling across this issue just before I setup my new build for unRAID. I have 2 970 EVO PLUS 500GB's I was going to use a cache pool using BTRFS. Guess I won't bother with that, can't hardware RAID them with current hardware, not going to get anything else to do that. I want redundancy with them, that's why I bought 2. Should I just get different SSD's? If so, which ones are good with no issues?

If you were planning to use them for this specific purpose and have the option to return them still, I'd say to do that. Right now there is no idea if/when this issue will be resolved.

As far as what to replace them with, I think anything non-Samsung will do. I believe the issue is specific to their drives, but someone correct me if I'm wrong.

Quote

1

December 17, 20196 yr

7 minutes ago, deusxanime said:

If you were planning to use them for this specific purpose and have the option to return them still, I'd say to do that. Right now there is no idea if/when this issue will be resolved.

As far as what to replace them with, I think anything non-Samsung will do. I believe the issue is specific to their drives, but someone correct me if I'm wrong.

I have some 1tb Adata drives and have the issue. I figured it was because they are kind of cheaper drives. I bought some 860s to swap them out with. If I don't use them here, I have plenty of places to use them

Quote

1

December 17, 20196 yr

19 minutes ago, FearlessUser said:

I have some 1tb Adata drives and have the issue. I figured it was because they are kind of cheaper drives. I bought some 860s to swap them out with. If I don't use them here, I have plenty of places to use them

Good to know, I thought it was only Samsung drives affected. Definitely want to do some research before purchasing new ones then to be sure they'll work correctly. I blew $600 (at the time) on two 850 EVO 1TB drives specifically to use as my unRAID cache drives in a mirror and was quite frustrated that it didn't work (and still doesn't a couple years later!). Hopefully others will be spared the pain and expense.

Quote

1

December 18, 20196 yr

I can dump one and keep one as XFS single cache drive for now, maybe adding another later if issue is resolved. I will need regular backups of data though from the cache drive that will house Dockers, etc.

Edit:

I'd much rather get drives that work, but which ones are those? I can't find an answer.

Edited December 18, 20196 yr by Iceman24

Quote

1

December 21, 20196 yr

I would like to chime in here as well.

I have 2x1 TB NVME drives in a RAID1 using BTRFS. My radarr/sab downloads also all sit on the cache. During heavy downloading my iowait also goes as high as 40%. All dockers become unusable during this time. Running 6.7.2.

System resources are not a problem with 64 GB of ram and a Ryzen 3900x, it seems to be the implementation of RAID1 Btrfs cache pools.

Edited December 21, 20196 yr by bobo89

Quote

December 22, 20196 yr

3 hours ago, bobo89 said:

I would like to chime in here as well.

I have 2x1 TB NVME drives in a RAID1 using BTRFS. My radarr/sab downloads also all sit on the cache. During heavy downloading my iowait also goes as high as 40%. All dockers become unusable during this time. Running 6.7.2.

System resources are not a problem with 64 GB of ram and a Ryzen 3900x, it seems to be the implementation of RAID1 Btrfs cache pools.

Everyone posting in here running 6.7.2 should upgrade to 6.8 Stable to at least remove the chance of your slowdown being from the "writes starves reads" bug.

Quote

January 3, 20206 yr

I'm on latest release 6.8 using two ADATA SU635 480GB 3D-NAND SATA SSD's in a BTRFS pool and I also have this issue. i just built my server and read to avoid the Samsung Disks but seems I also get it with the Adata disk. Will try with COW/checksums disabled as @johnnie.black mentioned.

nas4x12-diagnostics-20200102-1903.zip

Edited January 3, 20206 yr by drjUnraid

Quote

January 4, 20206 yr

So it's been a couple of years and this is still an issue? That's unfortunate. I'm in the process of building a new 6.8 server and was planning on using a couple Samsung SSD drives for a cache pool. Has anyone got that working without having the issues mentioned in this thread and if so using what SSD drives? Thanks!

Quote

1

January 4, 20206 yr

Only way for me was to reformat to XFS 😞

Quote

January 8, 20206 yr

I'm using two Samsung 860 EVO 1TB drives in my cache pool in Raid1 and the server is NOT locking up for me when I transfer large files. I already bought the drives before I saw this thread, but can still return them. I like tweaking and tuning stuff so I was trying to reproduce the issues others are seeing in this thread before making the decision to possibly return the drives. I can copy a 50GB file to the cache pool and don't see any issues.

My main Unraid server is still running 5.0. I recently upgraded my backup server from 5.0rc11 to 6.8. Also, I swapped the case from a 4U Norco 4020 to a silent mid tower because I'm relocating the server to a different location (noise is an issue) and added the SSDs.

I installed a bunch of docker containers and a couple of VMs. Tonight when I shutdown the server to add the second cache drive, after restart my VMs are no longer visible in the GUI. Don't know why, started another thread on that issue here:

My Hardware Components:
CPU: Intel Xeon E3-1220 Sandy Bridge
Motherboard: Supermicro X9SCM-IIF-O
RAM: 32GB - 4x Super Talent DDR3-1333 8GB ECC Micron
Controllers: 1x IBM M1015. Flashed in IT mode.
Case: Antec P101 Silent
Power Supply: CORSAIR HX750
Flash: 4GB Cruzer Micro
Parity Drive: 1x4TB Seagate ST4000DM000 5900RPM 64MB 4x1000GB CC43
Data Drives: 5x4TB Seagate ST4000DM000 5900RPM 64MB 4x1000GB CC43
Cache Drives: 2x1TB Samsung SSD 860 EVO 1TB

Hard drives are connected to the M1015. SSDs are connected to SATA3 ports on the motherboard.

Multiple times I copied a 50GB file from a Win10 PC to my Unraid server over gigabit ethernet:

transfer.jpg.28377d8610c8af89a63d3ab0a47d980b.jpg

Cache pool during transfer:

Top during transfer:

top.jpg.e234fa4b2ad6f39a454477a608a4d361.jpg

So, during the transfer I was at about 2 load average, highest I saw was ~3. I still need to figure out what's going on with the VMs, so I couldn't test with those. But during the transfer I used several docker containers and didn't notice any performance impacts, including:

Krusader - browsing files/folders on the server
CouchDB - exploring the GUI/interface
dukuwiki - Editing wiki pages
Oracle Database - browsing with the console

Everything appears to be working for me with 2 Samsung SSDs in my cache pool while copying large files. Should my test have reproduced the problem others are seeing? Anything else I can/should try?

Best Regards,

Jimmy

Edited January 8, 20206 yr by JimmyJoe

Quote

January 16, 20206 yr

I had similar symptoms, using an older Samsung 830 SSD as a single Btrfs LUKS-encrypted cache. When copying very large file, iowait would hit the 80's and then at some point the system became unresponsive, and write speeds were around 80 MB/s. Howerver, moving to XFS LUKS-encrypted did not help things at all.

In my case, it had to do with LUKS-encryption. Moving to non-encrypted cache, either Btrfs or XFS, iowait would be much lower, and write speeds at 200. However, I'm on an i7-3770 which has AES acceleration and have barely any CPU utilization

One guess is that the 830 controller doesn't handle incompressible data as well, but looking at reviews, that's where it shined compared to Sandforce controllers.

Some searching lead me to this post:

Quote

For large writes, the default multiqueue scheduler can end up filling multiple queues of sequential IO that look like random IO (to some devices that have trouble with internal multiqueue scheduling), so it may be worth trying the "none" queuing algorithm to see if this improves things.

Setting the IO Scheduler to none for my cache drive helped a bit, but lowering nr_requests with any IO scheduler helped more, at least in my case.

Edited January 16, 20206 yr by robobub

Quote

March 2, 20206 yr

Exact same issue happening to me. Server locks up completely when copying to BTRFS cache drive (single drive)

Seeing IOWAIT up to 50% plus

Samsung 850 Pro 2TB SSD using motherboard SATA

Raised in bug section as a problem.

Frankly surprised this doesn't appear to be getting looked into by LT, given how Samsung make arguably the most popular SSDs in the world?

Edited March 2, 20206 yr by sdamaged

Quote

1

March 2, 20206 yr

Author

10 hours ago, sdamaged said:

Exact same issue happening to me. Server locks up completely when copying to BTRFS cache drive (single drive)

Seeing IOWAIT up to 50% plus

Samsung 850 Pro 2TB SSD using motherboard SATA

Raised in bug section as a problem.

Frankly surprised this doesn't appear to be getting looked into by LT, given how Samsung make arguably the most popular SSDs in the world?

LT dropped by once and asked for a summary, then crickets. Try emailing them and linking to this thread

Quote

March 2, 20206 yr

Messaged LT, lets hope they can help get this fixed!

Quote

March 29, 20206 yr

Hi there,

i just started with Unraid but i am also affected - i have 2x 1TB 860 qvo SSD's

My IO wait goes >60 sometimes and the server locks up almost fully. During rebalance etc i see 2 x 500 Mbyte/s so bandwidth or controller is hardly an issue.

I tried configuring the ssd's as raid1 and raid0, same issue. Did try to figure out how to change it to XFS, but unfortunately i found out, that the btrfs raid1 did not work as expected - and so i am currently re-playing the backups & downloading meta data This is very annoying!
I hope this gets fixed soon! Can't be so difficult to allow for a partition offset ?

Server : UnraidPro 6.8.3, T620 2 x 2690v1 Xeon, 128GB, 8x8TB, 5x14TB - ssd's are on 2118IT p16 (trim enabled).

Quote

March 29, 20206 yr

I was seeing this with a pool of 2 512Gb SSDs. I have since switched to a single Intel NVME drive and the problem has gone.

Quote

March 31, 20206 yr

so this seems then also related to all the other cases when unraid seems frozen / unresponsive etc.

Why is no one looking into this

Can't be so difficult to allow a different partition offset for some disks ?

I just bought this PRO license and thought i am getting some support for this as well.

The system otherwise looks really nice and promising, but if the issues are not being fixed ??

Edited March 31, 20206 yr by ephigenie

Quote

March 31, 20206 yr

On 3/29/2020 at 9:13 PM, allanp81 said:

I was seeing this with a pool of 2 512Gb SSDs. I have since switched to a single Intel NVME drive and the problem has gone.

Ok i mean this is also a possibility "just throw more money at the problem" .

However i think this should concern the Limetech Team and there needs to be a bugfix for this.

The docker is up, because i tried before to update "one" docker image. Took 1h, i gave up (binhex-plexpass). This is so bad.

I have a Single SSD in my old box running plain Debian and 40+ Containers (it was my previous media server) and

have never had those kind of performance issues. This is really a shame. I don't think its near anywhere acceptable

having a 128gb, dual xeon, 2 x ssd bla bla server idling there basically completely and utterly busy with himself only.

I used mergerfs in my old box before and it was performing really nice. Now i thought this does look better

and neatly integrated and for me in order not to fiddle around anymore with those things i bought into Unraid.

I just later saw unfortunately there are solutions based on ZFS as well that have emerged to have nice interfaces now as well...

And docker etc.

However. Now can we get this fixed please ? What more information is needed to narrow done on that bug ?

777445060_Screenshot2020-03-3113_00_23.png.6441e0b85f8be0db258cef34db2cb1c3.png

Quote

April 23, 20206 yr

Same issues with 2 MX500's formatted BTRFS. Extremely disconcerting that this has been a known issue for so long. Seriously thinking about moving away from Unraid tbh.

Quote

April 23, 20206 yr

@limetech, bumping this thread your way again, we got your attention in November but lost you since then.
Issue is, anyone using Samsung SSDs (among other brands too) in a btrfs cache pool in unraid will see performance fall off a cliff due to partitions starting on sector 64. E.g., if you transfer a large file from/to the btrfs cache pool, all the dockers in unraid will lock up.

@wgards, best option for now is to drop your cache down to one drive and reformat to XFS.

Quote

1

April 23, 20206 yr

Community Expert

36 minutes ago, wgards said:

Same issues with 2 MX500's formatted BTRFS. Extremely disconcerting that this has been a known issue for so long.

The problem is that it doesn't affect everyone, I have a pool of MX500 for more than a year working without any issues.

Quote

Large copy/write on btrfs cache pool locking up server temporarily

Featured Replies

Top Posters In This Topic

Popular Days

Most Popular Posts

JorgeB

limetech

Allram

Posted Images

Join the conversation

Top Posters In This Topic

Popular Days

Most Popular Posts

JorgeB

limetech

Allram

Posted Images

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)