BTRFS remove on cache drive

January 13, 20251 yr

I wanted to replace my cache drifves. They have been running great, but the all of the drives in the pool are now 97 months old so I figured it was time to be proactive. The pool was one 512GB and two 240GB drives.

Instead of a "remove" command, I chose to use "add" and then "remove". I successfully added a fourth 240GB drive (using "btrfs device add"). However, when I then try to remove the old drive, I get an error. I did a balance but it didn't help:

root@Tower:~# btrfs balance start -dusage=75 /mnt/cache/
Done, had to relocate 0 out of 258 chunks
root@Tower:~# btrfs fi us -T /mnt/cache
Overall:
    Device size:                   1.12TiB
    Device allocated:            514.06GiB
    Device unallocated:          633.59GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                        494.13GiB
    Free (estimated):            325.41GiB      (min: 325.41GiB)
    Free (statfs, df):           289.73GiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              426.64MiB      (used: 0.00B)
    Multiple profiles:                  no

             Data      Metadata  System
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
 1 /dev/sdk1  92.00GiB         -        -   131.57GiB 223.57GiB     -
 3 /dev/sdf1  90.00GiB   1.00GiB        -   132.57GiB 223.57GiB     -
 4 /dev/sdl1 255.00GiB   2.00GiB 32.00MiB   219.91GiB 476.94GiB     -
 5 /dev/sdo1  73.00GiB   1.00GiB 32.00MiB   149.54GiB 223.57GiB     -
-- --------- --------- --------- -------- ----------- --------- -----
   Total     255.00GiB   2.00GiB 32.00MiB   633.59GiB   1.12TiB 0.00B
   Used      246.38GiB 703.03MiB 80.00KiB
root@Tower:~# btrfs device remove /dev/sdk /mnt/cache
ERROR: error removing device '/dev/sdk': No data available
root@Tower:~#

I had backed up the cache contents, but is there any way to do this without deleting and entirely recreating my cache pool?

Quote

January 13, 20251 yr

Author

Whoops. I had the drive not the partition... "btrfs device remove /dev/sdk1 /mnt/cache" did the trick.

Quote

January 13, 20251 yr

Community Expert

Note that you will need to re-import the pool for the GUI to be in sync, if you need help with that let us know, and post which release you are running.

Quote

January 13, 20251 yr

Author

37 minutes ago, JorgeB said:

Note that you will need to re-import the pool for the GUI to be in sync, if you need help with that let us know, and post which release you are running.

Is it this? https://docs.unraid.net/unraid-os/manual/storage-management/

The device is now removed from the pool, you don't need to stop the array now, but at the next array stop you need to make Unraid forget the now-deleted member, and to achieve that:

Stop the array
Unassign all pool devices
Start the array to make Unraid "forget" the pool config
If the docker and/or VMs services were using that pool best to disable those services before start or Unraid will recreate the images somewhere else, assuming they are using /mnt/user paths)

Stop array (re-enable docker/VM services if disabled above)
Re-assign all pool member except the removed device
Start array

After removing all of the cache drives and hitting 'start', it is taking forever for the GUI to come back. If I open another browser window, I can see that the drive pool is actually online (but not the cache pool yet). The GUI is still busy and shows "Starting...". I think that this is probably ok but will give it an hour or so and see what happens.

By the way, the quoted bit above was helpful but I did find one oversight (at least for me). In addition to the above points about disabling docker and VM, I discovered that I had syslog pointing to my (now missing) cache pool. I was able to disable syslog while the array is being brought up, so not that big a deal.

Edited January 13, 20251 yr by tcharron

Quote

January 13, 20251 yr

Community Expert

It shouldn't take a long time, post the current syslog.

Quote

January 13, 20251 yr

Author

39 minutes ago, JorgeB said:

It shouldn't take a long time, post the current syslog.

Here it is.

fyi, I had added the dev/sdo device to the btrfs pool last night, and everything was working ok with 4 drives in the pool. I removed sdk this morning, and everything was working ok. It was only after I took the array down to tell unraid that the pool only had 3 devices that the problem started. I see in this log that the sdo became unmountable after trying to bring the array back up. That should be recoverable given the redundancy of btrfs (I confirmed it had balanced properly before taking the array down), but I think it's a good thing that I have a copy of the cache contents elsewhere!

Let me know what you think.

syslog.zip

Quote

January 13, 20251 yr

Community Expert

The last start I see is with all the pool devices unassigned, and it's full of spam.

Quote

January 13, 20251 yr

Author

Not sure what spam you refer to, but I suspect you mean the syslog errors. Those were all related to having the syslog daemon point to the cache drive. That activity stopped when I disabled syslog.

The GUI did finally complete the process of bringing the array up, and the status now shows as "Started". There is a gap in the syslog entries of about 24 minutes:

Jan 13 10:23:31 Tower emhttpd: Starting File Activity...
... [errors related to the misconfigured syslog]
Jan 13 10:57:15 Tower file.activity: File Activity inotify starting
Jan 13 10:57:15 Tower inotifywait[8273]: Setting up watches.  Beware: since -r was given, this may take a while!
Jan 13 10:57:17 Tower root: Delaying execution of fix common problems scan for 10 minutes

I had 1) shut down array; 2) removed all the cache devices; 3) started array; 4) stopped array; 5) added cache devices; 6) restarted. I think I should have deleted the cache pool after step 2 and then recreated it. So I tried adding the cache pool again, including that step (ie: deleting cache pool and recreating it). Now, I get these in my log when bringing the system up:

Jan 13 12:28:02 Tower emhttpd: shcmd (1687651): mkdir -p /mnt/cache
Jan 13 12:28:02 Tower emhttpd: /sbin/btrfs filesystem show /dev/sdl1 2>&1
Jan 13 12:28:02 Tower emhttpd: Label: none  uuid: 100735db-0e88-4450-a406-40f3efdd2bb7
Jan 13 12:28:02 Tower emhttpd: #011Total devices 3 FS bytes used 247.08GiB
Jan 13 12:28:02 Tower emhttpd: #011devid    3 size 223.57GiB used 124.00GiB path /dev/sdf1
Jan 13 12:28:02 Tower emhttpd: #011devid    4 size 476.94GiB used 249.03GiB path /dev/sdl1
Jan 13 12:28:02 Tower emhttpd: #011devid    5 size 223.57GiB used 124.03GiB path /dev/sdo1
Jan 13 12:28:02 Tower emhttpd: /mnt/cache uuid: 100735db-0e88-4450-a406-40f3efdd2bb7
Jan 13 12:28:02 Tower emhttpd: shcmd (1687652): mount -t btrfs -o noatime,space_cache=v2 -U 100735db-0e88-4450-a406-40f3efdd2bb7 /mnt/cache
Jan 13 12:28:02 Tower root: mount: /mnt/cache: wrong fs type, bad option, bad superblock on /dev/sdk1, missing codepage or helper program, or other error.

What is surprising is the reference to "bad superblock on /dev/sdk1". SDK1 is the disk that was removed from the pool several hours earlier:

Jan 13 09:20:19 Tower kernel: BTRFS info (device sdk1): device deleted: /dev/sdk1

This is a bit academic for me since I can recover from backup of the cache, but solving it may be helpful for someone else. Let me know if there's any other info or tests that I can help with.

Quote

January 13, 20251 yr

Community Expert

If the spam stopped post new diags

Quote

January 13, 20251 yr

Author

The rsyslog was turned off before the syslog above was extracted.

I've just rebuilt the cache pool from scratch and it's working now. I'll reboot tonight as I need to remove some physical drives. I'll post back here if it is still slow to bring the array up.

Quote

1

January 15, 20251 yr

Author

I am running 6.12.10.

When the array is brought online, it still takes forever. The array and cache are up, but something is causing a huge pause in activity, and the SMB shares are not available for around 30 minutes or more.

I looked at the "main" page in a browser, and can see the 'reads' count rising slowly for each of my drives (sequentially, not all at once). I was able to use 'ps' and figured out that it is caused by these processes:

UID        PID  PPID  C STIME TTY          TIME CMD
root      4277  4276  0 23:34 ?        00:00:01 find /mnt/disk4 -type d
root      4276 22055  0 23:34 ?        00:00:00 /bin/bash /usr/local/emhttp/plugins/file.activity/scripts/rc.file.activity start
root     22055 22054  0 23:14 ?        00:00:00 /bin/bash /usr/local/emhttp/plugins/file.activity/scripts/rc.file.activity start
root     22054 21657  0 23:14 ?        00:00:00 /bin/bash /usr/local/emhttp/plugins/file.activity/event/disks_mounted disks_mounted
root     21657 10575  0 23:14 ?        00:00:00 /bin/bash /usr/local/sbin/emhttp_event disks_mounted
root     10575     1  0 Jan13 ?        00:04:13 /usr/local/bin/emhttpd

I could see the 'find' command above progressing through each of my drives, and the 'read' counts risin gin the web guii. Once that finished with the last drive, the array status in the web gui changed from "starting..." to "started", and the SMB shares become available. My syslog showed a 34 minute gap just now as I brought the array up:

Jan 14 23:14:15 Tower emhttpd: Starting File Activity...
Jan 14 23:14:16 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="22046" x-info="https://www.rsyslog.com"] start
Jan 14 23:48:07 Tower file.activity: File Activity inotify starting
Jan 14 23:48:07 Tower inotifywait[25916]: Setting up watches.  Beware: since -r was given, this may take a while!
Jan 14 23:48:09 Tower unassigned.devices: Mounting 'Auto Mount' Devices...
Jan 14 23:48:09 Tower emhttpd: Starting services...
Jan 14 23:48:09 Tower emhttpd: shcmd (202405): /etc/rc.d/rc.samba restart
Jan 14 23:48:11 Tower root: Starting Samba:  /usr/sbin/smbd -D

I wonder if the problem is file.activity and the related inotify? That message didn't show up in the log until after the 34 minute pause, but that seems like the kind of thing that would be scanning files.

Quote

January 15, 20251 yr

Community Expert

Try booting in safe mode first to rule out any plugin issues, if the same post new diags after the server is working normally.

Quote

January 16, 20251 yr

Author

The "slow array start" problem is due to a problem with the file.activity plugin. That plugin was updated November 25 2024, but users will not be affected until the array is restarted.

I described the issue here: https://forums.unraid.net/topic/54808-file-activity-plugin-how-can-i-figure-out-what-keeps-spinning-up-my-disks/page/17/#findComment-1512201

Quote

1

BTRFS remove on cache drive

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)