Cache Drive Failure?


Recommended Posts

Hi

I think I'm having a BTRFS issue.

2x240GB SSD

unRaid 6.2.4

 

When I start to move file over to the cache drive, it will set /mnt/cache to read only causing all my dockers to start to fail.

 

Seeing this in the syslog

 

Jan 31 01:11:28 Tower kernel: ------------[ cut here ]------------
Jan 31 01:11:28 Tower kernel: WARNING: CPU: 7 PID: 26931 at fs/btrfs/extent-tree.c:4180 btrfs_free_reserved_data_space_noquota+0x5b/0x7b()
Jan 31 01:11:28 Tower kernel: Modules linked in: xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod e1000e ptp pps_core coretemp kvm_intel kvm mpt3sas ahci raid_class i2c_i801 i2c_core libahci i5500_temp scsi_transport_sas acpi_cpufreq [last unloaded: ipmi_devintf]
Jan 31 01:11:28 Tower kernel: CPU: 7 PID: 26931 Comm: kworker/u50:13 Tainted: G        W       4.4.30-unRAID #2
Jan 31 01:11:28 Tower kernel: Hardware name: Supermicro X8DT6/X8DT6, BIOS 2.0c    05/15/2012
Jan 31 01:11:28 Tower kernel: Workqueue: writeback wb_workfn (flush-btrfs-18)
Jan 31 01:11:28 Tower kernel: 0000000000000000 ffff880566523600 ffffffff8136f79f 0000000000000000
Jan 31 01:11:28 Tower kernel: 0000000000001054 ffff880566523638 ffffffff8104a4ab ffffffff812ada13
Jan 31 01:11:28 Tower kernel: 0000000000001000 ffff880603277400 ffff88062ff7a960 ffff880566523734
Jan 31 01:11:28 Tower kernel: Call Trace:
Jan 31 01:11:28 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e
Jan 31 01:11:28 Tower kernel: [<ffffffff8104a4ab>] warn_slowpath_common+0x8f/0xa8
Jan 31 01:11:28 Tower kernel: [<ffffffff812ada13>] ? btrfs_free_reserved_data_space_noquota+0x5b/0x7b
Jan 31 01:11:28 Tower kernel: [<ffffffff8104a568>] warn_slowpath_null+0x15/0x17
Jan 31 01:11:28 Tower kernel: [<ffffffff812ada13>] btrfs_free_reserved_data_space_noquota+0x5b/0x7b
Jan 31 01:11:28 Tower kernel: [<ffffffff812c4e16>] btrfs_clear_bit_hook+0x143/0x272
Jan 31 01:11:28 Tower kernel: [<ffffffff812db58b>] clear_state_bit+0x8b/0x155
Jan 31 01:11:28 Tower kernel: [<ffffffff812db88d>] __clear_extent_bit+0x238/0x2c3
Jan 31 01:11:28 Tower kernel: [<ffffffff812dbd49>] clear_extent_bit+0x12/0x14
Jan 31 01:11:28 Tower kernel: [<ffffffff812dc2dc>] extent_clear_unlock_delalloc+0x46/0x18f
Jan 31 01:11:28 Tower kernel: [<ffffffff8111f019>] ? igrab+0x32/0x46
Jan 31 01:11:28 Tower kernel: [<ffffffff812d8c93>] ? __btrfs_add_ordered_extent+0x288/0x2cf
Jan 31 01:11:28 Tower kernel: [<ffffffff812c8c13>] cow_file_range+0x300/0x3bd
Jan 31 01:11:28 Tower kernel: [<ffffffff812c988f>] run_delalloc_range+0x321/0x331
Jan 31 01:11:28 Tower kernel: [<ffffffff812dc915>] writepage_delalloc.isra.14+0xaa/0x126
Jan 31 01:11:28 Tower kernel: [<ffffffff812dea19>] __extent_writepage+0x150/0x1f7
Jan 31 01:11:28 Tower kernel: [<ffffffff812ded16>] extent_write_cache_pages.isra.10.constprop.24+0x256/0x30c
Jan 31 01:11:28 Tower kernel: [<ffffffff812da4da>] ? submit_one_bio+0x81/0x88
Jan 31 01:11:28 Tower kernel: [<ffffffff812df214>] extent_writepages+0x46/0x57
Jan 31 01:11:28 Tower kernel: [<ffffffff812c69ca>] ? btrfs_direct_IO+0x28e/0x28e
Jan 31 01:11:28 Tower kernel: [<ffffffff812c555f>] btrfs_writepages+0x23/0x25
Jan 31 01:11:28 Tower kernel: [<ffffffff810c3738>] do_writepages+0x1b/0x24
Jan 31 01:11:28 Tower kernel: [<ffffffff8112a5d4>] __writeback_single_inode+0x3d/0x151
Jan 31 01:11:28 Tower kernel: [<ffffffff8112ab89>] writeback_sb_inodes+0x20d/0x3ad
Jan 31 01:11:28 Tower kernel: [<ffffffff8112ad9a>] __writeback_inodes_wb+0x71/0xa9
Jan 31 01:11:28 Tower kernel: [<ffffffff8112af80>] wb_writeback+0x10b/0x195
Jan 31 01:11:28 Tower kernel: [<ffffffff8112b4c9>] wb_workfn+0x18e/0x22b
Jan 31 01:11:28 Tower kernel: [<ffffffff8112b4c9>] ? wb_workfn+0x18e/0x22b
Jan 31 01:11:28 Tower kernel: [<ffffffff8105aede>] process_one_work+0x194/0x2a0
Jan 31 01:11:28 Tower kernel: [<ffffffff8105b894>] worker_thread+0x26b/0x353
Jan 31 01:11:28 Tower kernel: [<ffffffff8105b629>] ? rescuer_thread+0x285/0x285
Jan 31 01:11:28 Tower kernel: [<ffffffff8105fb24>] kthread+0xcd/0xd5
Jan 31 01:11:28 Tower kernel: [<ffffffff8105fa57>] ? kthread_worker_fn+0x137/0x137
Jan 31 01:11:28 Tower kernel: [<ffffffff81629f7f>] ret_from_fork+0x3f/0x70
Jan 31 01:11:28 Tower kernel: [<ffffffff8105fa57>] ? kthread_worker_fn+0x137/0x137
Jan 31 01:11:28 Tower kernel: ---[ end trace cc2c8a28b871c88c ]---

 

root@Tower:/mnt/cache# touch test.txt
touch: cannot touch 'test.txt': No space left on device

 

Appears to happen when I get about 40-50% full on the cache drive.  When I "move" files off, it appears to go away.  I have run the scrub tool several times without any errors.

 

Possible that my SSD's are going bad?

 

Thanks

Dave

Link to comment

Sorry...  Posting below

 

root@Tower:/var/log# btrfs fi show /mnt/cache
Label: none  uuid: e2234ca5-51be-4ccb-9302-3194731c77e3
        Total devices 2 FS bytes used 67.43GiB
        devid    1 size 223.57GiB used 223.57GiB path /dev/sdg1
        devid    2 size 223.57GiB used 223.57GiB path /dev/sdh1

root@Tower:/var/log# btrfs device stats /mnt/cache
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0
[/dev/sdh1].write_io_errs   0
[/dev/sdh1].read_io_errs    0
[/dev/sdh1].flush_io_errs   0
[/dev/sdh1].corruption_errs 0
[/dev/sdh1].generation_errs 0

 

Currently, I can write the cache file, but still getting the Kernel warnings with fs/btrfs/extent-tree.

 

Thanks

Dave

Link to comment

Stopped the array, disabled Docker (set to no), deleted the docker.img file.  Tried to run the balance command, but got an error.

 

root@Tower:/mnt/cache# rm docker.img
root@Tower:/mnt/cache# ls -ltr
total 0
drwxrwxrwx 1 nobody users 246 Jan 30 22:41 appdata/
drwxrwxrwx 1 nobody users  16 Jan 31 01:00 Download/
drwxrwxrwx 1 nobody users  32 Jan 31 01:00 Movies/
drwxrwxrwx 1 nobody users  24 Jan 31 01:00 TV/
drwxrwxrwx 1 nobody users 240 Jan 31 08:36 transcode/

 

 

root@Tower:/mnt/cache# btrfs balance start -dusage=75 /mnt/cache

ERROR: error during balancing '/mnt/cache': No space left on device

There may be more info in syslog - try dmesg | tail

 

 

root@Tower:/mnt/cache# btrfs fi show /mnt/cache
Label: none  uuid: e2234ca5-51be-4ccb-9302-3194731c77e3
        Total devices 2 FS bytes used 30.86GiB
        devid    1 size 223.57GiB used 223.57GiB path /dev/sdg1
        devid    2 size 223.57GiB used 223.57GiB path /dev/sdh1

root@Tower:/mnt/cache# btrfs fi df /mnt/cache
Data, RAID1: total=222.54GiB, used=30.15GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=1.00GiB, used=725.61MiB
GlobalReserve, single: total=256.00MiB, used=0.00B
root@Tower:/mnt/cache# btrfs device stats /mnt/cache
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0
[/dev/sdh1].write_io_errs   0
[/dev/sdh1].read_io_errs    0
[/dev/sdh1].flush_io_errs   0
[/dev/sdh1].corruption_errs 0
[/dev/sdh1].generation_errs 0

 

Link to comment

Was afraid of that as almost all space is allocated, v6.3 includes a more recent kernel and btrfs-progs that a lot better at avoiding this.

 

You can try to delete some files (the larger the better) and run balance again, if it still doesn't work best to backup cache and re-format, you can follow this procedure, but format cache instead of replacing:

 

https://lime-technology.com/forum/index.php?topic=48508.msg516110#msg516110

Link to comment

Okay, trying to format the cache drives.

 

Is there a how to on the format?  I don't see any option to format the drives.  I have removed them, started the array, stopped the array, then added them back.  But it does not look like it has formated them.  Didn't find much with a search either.

 

Thanks

Dave

Link to comment

You can follow the rest of the cache replace procedure to move appdata back.

 

It's important to regularly monitor the file system slack, especially on unRAID <6.3, when there's 20% or bigger difference between total an used, run a balance, so the same thing doesn't happen again.

 

This is bad:

 

Data, RAID1: total=222.54GiB, used=66.79GiB

 

This is good:

 

Data, RAID10: total=906.00GiB, used=882.08GiB

 

Link to comment

So I want to be clear what your saying.

 

Hit the "Balance" button when in the cache drive, under "Balance Status"?

 

Or, run the command you gave me?

btrfs balance start -dusage=75 /mnt/cache

 

If it's the command, can I put it in a cron job weekly?  Or will 6.3 be out soon enough.

 

Thanks

Dave

Link to comment

You can use the command on the cache page, by default it will run a full balance, so it will take longer, the command I gave only balances chunks where data use is at 75% or below, so it's faster, and you use lower values, 50, 25, etc, you can also replace the default command options (-dconvert=raid1 -mconvert=raid1) by other filters, eg (-dusage=50).

 

No need to do it daily, monitor the file system for a few days/weeks and see how it progresses, a balance will re-right the data (all data if it's a full balance) so it will cause wear on the SSDs.

Link to comment

This ought to be a FAQ. I just rebalanced my cache after reading this thread.

 

Before:

root@Lapulapu:~# btrfs fi df /mnt/cache

Data, RAID1: total=236.44GiB, used=44.44GiB

System, RAID1: total=32.00MiB, used=64.00KiB

Metadata, RAID1: total=2.00GiB, used=303.91MiB

GlobalReserve, single: total=38.64MiB, used=0.00B

 

After:

root@Lapulapu:~# btrfs fi df /mnt/cache

Data, RAID1: total=45.00GiB, used=44.43GiB

System, RAID1: total=32.00MiB, used=16.00KiB

Metadata, RAID1: total=1.00GiB, used=297.05MiB

GlobalReserve, single: total=31.78MiB, used=0.00B

Link to comment

This ought to be a FAQ. I just rebalanced my cache after reading this thread.

 

Before:

root@Lapulapu:~# btrfs fi df /mnt/cache

Data, RAID1: total=236.44GiB, used=44.44GiB

System, RAID1: total=32.00MiB, used=64.00KiB

Metadata, RAID1: total=2.00GiB, used=303.91MiB

GlobalReserve, single: total=38.64MiB, used=0.00B

 

After:

root@Lapulapu:~# btrfs fi df /mnt/cache

Data, RAID1: total=45.00GiB, used=44.43GiB

System, RAID1: total=32.00MiB, used=16.00KiB

Metadata, RAID1: total=1.00GiB, used=297.05MiB

GlobalReserve, single: total=31.78MiB, used=0.00B

Looks like it should be pretty easy to parse the output of the fi df command and recommend whether or not to balance. Squid, how about a FCP addon?
Link to comment

Yes, maybe a FCP warning would be best, but note that normally btrfs should delete unused data chunks and a manual balance should not be needed, but in practice this does not always happen, especially on older kernels, so it's a good idea to keep ahead of possible file system full issues.

 

Issues can happen when all available space is allocated like what happened to the OP:

 

devid    1 size 223.57GiB used 223.57GiB path /dev/sdg1
devid    2 size 223.57GiB used 223.57GiB path /dev/sdh1

 

So filesystem is completely allocated, here we can see the actual used space:

 

Data, RAID1: total=222.54GiB, used=66.79GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=1.00GiB, used=728.33MiB
GlobalReserve, single: total=256.00MiB, used=0.00B

 

Totals are the different allocated type of chunks, mainly data and metadata, and used is the actual used space, issues arise when the file system is all allocated and metadata is close to full, so for any new writes btrfs tries to allocated a new metadata chunk and fails, resulting in an out of space error.

 

So although it's good practice to keep a low slack (difference between total and used), it's only a problem when the device is almost all allocated, so ideally there should be a suggestion to run a balance when there's and big slack and and a warning when there's slack and the device is almost all allocated.

 

 

 

 

 

 

 

 

 

Link to comment

This is fascinating stuff. So while my cache pool had a lot of slack

 

Data, RAID1: total=236.44GiB, used=44.44GiB

 

there was still sufficient free metadata space

 

Metadata, RAID1: total=2.00GiB, used=303.91MiB

 

for it not to be an immediate problem, while the OP was running dangerously low on free metadata space.

 

Metadata, RAID1: total=1.00GiB, used=728.33MiB

 

I've just installed 6.3.0rc9 on this server so I'll be interested to see if it becomes unbalanced again, and if so, how quickly.

 

Link to comment

Johnnie.  Let me know what commands and output I'm looking for

 

For now I think you should just add a warning if there's some slack and the allocated space is near max, there's a newer command that will eventually replace btrfs fi df that gives all the info needed, I was not sure if it was available on v6.2 but it is:

 

btrfs fi usage /mnt/cache

 

Overall:

    Device size:                  2.79TiB

    Device allocated:              1.78TiB

    Device unallocated:            1.01TiB

    Device missing:                  0.00B

    Used:                          1.57TiB

    Free (estimated):            624.84GiB      (min: 624.84GiB)

    Data ratio:                      2.00

    Metadata ratio:                  2.00

    Global reserve:              512.00MiB      (used: 0.00B)

 

There is also info by device but not needed for this, the values of interest are "Device unallocated" and slack (the difference between "Used" and "Device allocated").

 

So based on these values, give a warning if there's significant slack (eg difference between used and allocated is larger than 20%) AND unallocated space is 5% or less of total device size, user should run a balance to reclaim unused chunks, eg:

 

btrfs balance start -dusage=50 /mnt/cache

 

50 means it will only balanced chunks that are 50% or less occupied, this should be enough to reclaim most unused space but it can be changed to a higher value if needed.

 

For the suggestion to run a balance only based on slack let me monitor some devices for a while and see how they behave because in part due to how the cache is used, ie, constantly filled and emptied by the mover, some slack may be unavoidable, and not an issue as long as there is enough unallocated space.

Link to comment

 

For the suggestion to run a balance only based on slack let me monitor some devices for a while and see how they behave because in part due to how the cache is used, ie, constantly filled and emptied by the mover, some slack may be unavoidable, and not an issue as long as there is enough unallocated space.

 

I do run the mover every  morning as I can easily fill up the cache in a day.

Link to comment

About 80GiB put on Cache today, then ran the Mover

Data, RAID1: total=122.00GiB, used=46.53GiB
Data, single: total=1.00GiB, used=0.00B
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=2.00GiB, used=629.91MiB
GlobalReserve, single: total=224.00MiB, used=1.83MiB

 

After running Balance

Data, RAID1: total=48.00GiB, used=46.52GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=634.72MiB
GlobalReserve, single: total=224.00MiB, used=0.00B

 

 

Do you still not think I need to run the balancer in a cron job?

Link to comment
  • 1 year later...

Hello,

 

It seems i have similar problem...

First Transmission was not able to download says "No space left on device" so i tried to back it up, but i get the same error in MC 

image.png.5343a2225815228b8c4bae49e239847e.png

also not sure why i got "user0" with same folders as "user"

image.png.01be61f62bf17dd77cfab66131ce9fd8.png

 

Tried also those commands, but still "no space" error

 

root@unRAIDTower:~# btrfs fi show /mnt/cache
Label: none  uuid: 2bc7fced-04dc-491d-9449-09d79d7c8f5e
        Total devices 1 FS bytes used 74.12GiB
        devid    1 size 111.79GiB used 84.02GiB path /dev/sdd1

root@unRAIDTower:~# btrfs fi df /mnt/cache
Data, single: total=83.01GiB, used=73.98GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=1.01GiB, used=144.22MiB
GlobalReserve, single: total=65.17MiB, used=0.00B
root@unRAIDTower:~# btrfs device stats /mnt/cache
[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  0
[/dev/sdd1].generation_errs  0
root@unRAIDTower:~# btrfs balance start -dusage=75 /mnt/cache
Done, had to relocate 7 out of 87 chunks
root@unRAIDTower:~#

 

Any clue what elso to try? or only format whole "cache drive" ? 

 

Thanks

 

Link to comment
5 hours ago, killeriq said:

also not sure why i got "user0" with same folders as "user"

Not the same, user0 excludes the cache, user includes the cache.

 

5 hours ago, killeriq said:

Tried also those commands, but still "no space" error

Also not the same problem, you cache still has available space.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.