Large copy/write on btrfs cache pool locking up server temporarily

Followers

Prev
1
2
3
4
5
6
Next
Last page

June 27, 20179 yr

Hi,

I have a btrfs cache pool of 4 ssd drives, which hosts the docker image as well as downloaded data. I noticed that some of the docker apps were occasionally having issues like "database locked" and "write error, disk full?", etc.

After thorough testing, I realized that whenever there is a large file transfer on the cache pool, where the file is read from and written to the cache drive, the server temporarily locks up. The unraid gui is unreachable, the docker apps stall and their guis are unreachable and ssh access is slow and occasionally hangs in the middle of a basic operation like "pf -es". This continues until the file transfer is over, and a few minutes later, everything is back to normal. This seems to happen with files larger than about 10GB.

A typical scenario is, sabnzbd downloads a file over 10GB, during unrar of the file to the cache pool, everything else is locked up. Then, when the file is being moved from the sab temp folder to the Movies folder on the cache pool by radarr, again, everything else is locked up.

Because of the lock up, it is difficult to trouble shoot. I don't know what else to try or test.

I have sabnzbd only using certain cores and not all, so the issue should not be due to high cpu utilization during unrar. Therefore I believe it is due to disk io that is fully taken up by the file transfer.

Attached is my diagnostics, which should include info from this morning's 20GB file transfer (unrar operation failed a couple of times due to disk full message although there was 100GB of free space, but then succeeded on the third try).

I would appreciate any ideas or suggestions. Thanks

PS. File transfers from the cache pool to the array are completely fine. Mover does not affect general server operations and neither does a regular copy to the array.

tower-diagnostics-20170627-1153.zip

Quote

Replies 195
Views 47.3k
Created 9 yr9 yr
Last Reply 2 yr2 yr

Popular Days

Posted Images

June 27, 20179 yr

Does "load avarage" when doing the "top" command keep raising when you copy to the cache array? Something similar happends for me. Lock-ups during writes of larger files to the cache array (> 4gb files)

Edited June 27, 20179 yr by thomast_88

Quote

June 27, 20179 yr

Author

Load average goes up to 42 and stays there during transfer

Quote

June 28, 20179 yr

On 6/27/2017 at 9:09 PM, aptalca said:

Load average goes up to 42 and stays there during transfer

Ok, sounds exactly like the issue i have been having since i changed my cache from 1 to 2 disks :-(.

When it's at > 40ish %, VM's / Dockers begins crashing until i stop the write command.

Quote

June 29, 20179 yr

Author

After my last post I realized that my 2 VMs went into a paused state (they are never supposed to sleep).

I guess I didn't realize this before because most of the files I was downloading previously were in the 4-6GB range so the issue was minimal.

Lately I started downloading larger files in the 10-24GB range and now the issues are a lot more pronounced.

Quote

June 29, 20179 yr

Community Expert

VMs pause when they are out of space, timeout issues are probably causing that, no idea on what's causing the timeouts I'm afraid.

Quote

June 29, 20179 yr

Any ideas how we can diagnose this @johnnie.black? Like @aptalca mentioned, it's hard to do any diagnosis, as the system is unresponsive when it happens. I can easily reproduce this by writing a large file to the cache array, but then the system becomes unresponsive (load average keeps raising).

I basically stopped using my cache array, for bigger file transfers, as this problem is causing the whole system to lock up.

Quote

June 29, 20179 yr

Community Expert

4 minutes ago, thomast_88 said:

Any ideas how we can diagnose this @johnnie.black?

Not really... is this recent behavior? With v6.3 or v6.4 is the same?

My cache is a single device at the moment, but recently was using a 8 SSD pool in RAID10 and copied a lot of data daily at about 800/900MB/s (using 10Gbe) without any slowdowns.

Quote

June 29, 20179 yr

Author

I started noticing this in the last few months. But then again, I also weren't downloading such large files in the past so can't be sure if the issue existed in the past

Quote

June 29, 20179 yr

Since 6.3 for me. I already posted an issue half a year ago about this.

Quote

June 30, 20179 yr

Author

@johnnie.black what is the easiest way to go back to a single cache drive? I just want to do some comparison tests

Stop all VM and docker services, copy all cache files to array, stop array remove all cache drives but one? Any other setting changes?

Thanks

Quote

June 30, 20179 yr

Community Expert

If all data on the pool fits on a single device you don't even need to stop your docker/VMs, though it's always a good idea to backup any important data before starting:

Edited June 30, 20179 yr by johnnie.black

Quote

July 1, 20179 yr

Author

@johnnie.black thanks so much, with your instructions, I was able to switch to a single cache drive from a 4 ssd raid1

Quote

July 2, 20179 yr

@aptalca did it help?

Quote

July 2, 20179 yr

Author

Unfortunately I had to go out of town right after so didn't get a chance to do tests yet but will let you know when I do

Quote

July 5, 20179 yr

Author

I just got done testing. I am using a single cache drive, btrfs, a 500GB Samsung EVO (3D nand and pretty fast)

I tried a copy from cache to cache, a 24GB image file. Server didn't break a sweat, load average did not go up to more than 6, which is perfectly fine on my 8 core, 16 thread machine.

Then I tried a sab download, a 7.5GB file. During unrar, load average was at about 1.6. Then, while radarr was renaming and moving (cache to cache) load average went up to about 4 and stayed there.

I'll try a larger file download in the near future to simulate the scenario where I had the issue with a cache pool. But so far, it looks like my issues are gone with a single cache drive.

Quote

3 weeks later...

July 24, 20179 yr

@aptalca Sorry for bugging you. But how is your test progressing?

I'm at the same boat as you at the moment. I can conclude running with a single drive, everything works properly. But in raid1, things start to become unstable. This sucks pretty much, as this renders the cache functionality useless. But this has to be a configuration issue? Many people are running with several drives, and they don't report these issues...

Quote

July 24, 20179 yr

Author

I'm using a single btrfs cache drive still. No problems. I added the other drives through unassigned devices

Quote

July 24, 20179 yr

Hmpf - thanks for the quick reply. I'm hoping someone can step in with a working solution. I'd really like to utilize the btrfs raid function without the server crashing

Quote

5 weeks later...

August 23, 20178 yr

Author

Another update.

After switching to a single btrfs cache drive, I continued to have minor issues.

Sonarr and Radarr still logged "database locked" error messages (sqlite errors), likely due to high disk io during unrar and repairs, although these were much shorter lived compared to btrfs cache pool and they did not cause any issues apart from log messages.

Then I converted the drive to xfs and have not had any error messages logged.

I am convinced that the disk io is due to btrfs. The issues are much worse in a raid 0 config compared to a single btrfs drive

Quote

August 23, 20178 yr

I am having a similar issue with my cache drive which is btrfs as well. Simple things like copying a file over the network into the cache drive kill my docker container speeds loads shoot up into 12.00 and ssh is very slow even doing Ls commands.

if mover ever runs while copying well dockers start crashing.

how do you covert the cache from btrfs to xfs that sounds like it improved things. Can we still mirror without btrfs. I was looking at picking up a second ssd.

Quote

August 23, 20178 yr

1 hour ago, Maticks said:

how do you covert the cache from btrfs to xfs that sounds like it improved things. Can we still mirror without btrfs.

Conversion erases everything on the drive, back up what you need on the cache drive before doing it. When you are ready, click on the cache drive in the main gui screen with the array stopped and change the format to XFS. When you start the array, you will see a checkbox with confirmation to format all unmountable drives. As long as the cache drive is the only one listed, check the box and click format.

XFS is single volume only, no software mirror available.

Quote

August 23, 20178 yr

I think this is a serious bug which should be adressed. Isnt BTRFS the recommended way to run a cache array? Right now it's working poorly with all these crashes on large writing of files

Quote

August 23, 20178 yr

Author

I am having a similar issue with my cache drive which is btrfs as well. Simple things like copying a file over the network into the cache drive kill my docker container speeds loads shoot up into 12.00 and ssh is very slow even doing Ls commands.
if mover ever runs while copying well dockers start crashing.

how do you covert the cache from btrfs to xfs that sounds like it improved things. Can we still mirror without btrfs. I was looking at picking up a second ssd.

What I did was
1) mount a second ssd through unassigned devices plugin,
2) shut down all Dockers and VMs (turn off the services in the settings so they don't automatically restart when the array starts),
3) rsync all data from cache to unassigned device (rsync preserves permissions, timestamps, etc. with the option "a"),
4) stop the array,
5) change the disk format from btrfs to xfs and
6) restart the array.

It will format the cache drive, which takes about a minute. Then you can transfer your data back to the cache drive and enable the docker and VM services

If you don't have a spare ssd, you can rsync to an array disk as well. Make sure you use a disk share and not a user share for that (ie. /mnt/diskX)

Quote

August 24, 20178 yr

i can second this now since moving to xfs on cache things are running much smoother.

transferring a file over the network while running mover is fine now.

No Dockers are crashing like before and the system is still responsive. load never exceeds 8.00.

ssh is completely responsive as well, should this be raised as a bug to be looked at?

Quote

Prev
1
2
3
4
5
6
Next
Last page

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Replies 195
Views 47.3k
Created 9 yr9 yr
Last Reply 2 yr2 yr

Large copy/write on btrfs cache pool locking up server temporarily

Featured Replies

Top Posters In This Topic

Popular Days

Most Popular Posts

JorgeB

limetech

Allram

Posted Images

Join the conversation

Top Posters In This Topic

Popular Days

Most Popular Posts

JorgeB

limetech

Allram

Posted Images

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)