Large copy/write on btrfs cache pool locking up server temporarily


Recommended Posts

Hi,

 

I have a btrfs cache pool of 4 ssd drives, which hosts the docker image as well as downloaded data. I noticed that some of the docker apps were occasionally having issues like "database locked" and "write error, disk full?", etc.

 

After thorough testing, I realized that whenever there is a large file transfer on the cache pool, where the file is read from and written to the cache drive, the server temporarily locks up. The unraid gui is unreachable, the docker apps stall and their guis are unreachable and ssh access is slow and occasionally hangs in the middle of a basic operation like "pf -es". This continues until the file transfer is over, and a few minutes later, everything is back to normal. This seems to happen with files larger than about 10GB. 

 

A typical scenario is, sabnzbd downloads a file over 10GB, during unrar of the file to the cache pool, everything else is locked up. Then, when the file is being moved from the sab temp folder to the Movies folder on the cache pool by radarr, again, everything else is locked up.

 

Because of the lock up, it is difficult to trouble shoot. I don't know what else to try or test.

 

I have sabnzbd only using certain cores and not all, so the issue should not be due to high cpu utilization during unrar. Therefore I believe it is due to disk io that is fully taken up by the file transfer.

 

Attached is my diagnostics, which should include info from this morning's 20GB file transfer (unrar operation failed a couple of times due to disk full message although there was 100GB of free space, but then succeeded on the third try).

 

I would appreciate any ideas or suggestions. Thanks

 

PS. File transfers from the cache pool to the array are completely fine. Mover does not affect general server operations and neither does a regular copy to the array.

tower-diagnostics-20170627-1153.zip

  • Upvote 1
Link to comment
On 6/27/2017 at 9:09 PM, aptalca said:

Load average goes up to 42 and stays there during transfer

 

Ok, sounds exactly like the issue i have been having since i changed my cache from 1 to 2 disks :-(.

 

When it's at > 40ish %, VM's / Dockers begins crashing until i stop the write command.

Link to comment

After my last post I realized that my 2 VMs went into a paused state (they are never supposed to sleep).

 

I guess I didn't realize this before because most of the files I was downloading previously were in the 4-6GB range so the issue was minimal.

 

Lately I started downloading larger files in the 10-24GB range and now the issues are a lot more pronounced.

 

 

Link to comment

Any ideas how we can diagnose this @johnnie.black? Like @aptalca mentioned, it's hard to do any diagnosis, as the system is unresponsive when it happens. I can easily reproduce this by writing a large file to the cache array, but then the system becomes unresponsive (load average keeps raising).

 

I basically stopped using my cache array, for bigger file transfers, as this problem is causing the whole system to lock up.

Link to comment

I just got done testing. I am using a single cache drive, btrfs, a 500GB Samsung EVO (3D nand and pretty fast)

 

I tried a copy from cache to cache, a 24GB image file. Server didn't break a sweat, load average did not go up to more than 6, which is perfectly fine on my 8 core, 16 thread machine.

 

Then I tried a sab download, a 7.5GB file. During unrar, load average was at about 1.6. Then, while radarr was renaming and moving (cache to cache) load average went up to about 4 and stayed there.

 

I'll try a larger file download in the near future to simulate the scenario where I had the issue with a cache pool. But so far, it looks like my issues are gone with a single cache drive.

Link to comment
  • 3 weeks later...

@aptalca Sorry for bugging you. But how is your test progressing?

I'm at the same boat as you at the moment. I can conclude running with a single drive, everything works properly. But in raid1, things start to become unstable. This sucks pretty much, as this renders the cache functionality useless. But this has to be a configuration issue? Many people are running with several drives, and they don't report these issues...

Link to comment
  • 5 weeks later...

Another update.

 

After switching to a single btrfs cache drive, I continued to have minor issues.

 

Sonarr and Radarr still logged "database locked" error messages (sqlite errors), likely due to high disk io during unrar and repairs, although these were much shorter lived compared to btrfs cache pool and they did not cause any issues apart from log messages.

 

Then I converted the drive to xfs and have not had any error messages logged.

 

I am convinced that the disk io is due to btrfs. The issues are much worse in a raid 0 config compared to a single btrfs drive

  • Like 1
Link to comment

I am having a similar issue with my cache drive which is btrfs as well. Simple things like copying a file over the network into the cache drive kill my docker container speeds loads shoot up into 12.00 and ssh is very slow even doing Ls commands.

if mover ever runs while copying well dockers start crashing.

 

how do you covert the cache from btrfs to xfs that sounds like it improved things. Can we still mirror without btrfs. I was looking at picking up a second ssd.

  • Like 1
Link to comment
1 hour ago, Maticks said:

how do you covert the cache from btrfs to xfs that sounds like it improved things. Can we still mirror without btrfs.

Conversion erases everything on the drive, back up what you need on the cache drive before doing it. When you are ready, click on the cache drive in the main gui screen with the array stopped and change the format to XFS. When you start the array, you will see a checkbox with confirmation to format all unmountable drives. As long as the cache drive is the only one listed, check the box and click format.

 

XFS is single volume only, no software mirror available.

  • Like 1
Link to comment
I am having a similar issue with my cache drive which is btrfs as well. Simple things like copying a file over the network into the cache drive kill my docker container speeds loads shoot up into 12.00 and ssh is very slow even doing Ls commands.
if mover ever runs while copying well dockers start crashing.
 
how do you covert the cache from btrfs to xfs that sounds like it improved things. Can we still mirror without btrfs. I was looking at picking up a second ssd.
What I did was
1) mount a second ssd through unassigned devices plugin,
2) shut down all Dockers and VMs (turn off the services in the settings so they don't automatically restart when the array starts),
3) rsync all data from cache to unassigned device (rsync preserves permissions, timestamps, etc. with the option "a"),
4) stop the array,
5) change the disk format from btrfs to xfs and
6) restart the array.

It will format the cache drive, which takes about a minute. Then you can transfer your data back to the cache drive and enable the docker and VM services

If you don't have a spare ssd, you can rsync to an array disk as well. Make sure you use a disk share and not a user share for that (ie. /mnt/diskX)
  • Like 1
Link to comment

i can second this now since moving to xfs on cache things are running much smoother.

transferring a file over the network while running mover is fine now.

No Dockers are crashing like before and the system is still responsive. load never exceeds 8.00.

ssh is completely responsive as well, should this be raised as a bug to be looked at?

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.