Can't stop array (BTRFS operation is running)

lepis0 · October 17, 2022

Hi,

I changed one of my cache disks to new larger one.

Now when I start array I can't stop it because it says disabled BTRFS operation is running.

Syslog has thousands of rows

unraid kernel: BTRFS info (device sdh1): found 1 extents, stage: update data pointers

Cache balance show:

image.png.f98f3531bc39d9f2c194b9710d406d54.png

Balance button does nothing.

Pool also looks normal:

I had waited about week now for that BTRFS operation to stop.

I took diagnostic zip also and it's attached.

What should I do to get my cache to back normal state?

unraid-diagnostics-20221017-1449.zip

JorgeB · October 17, 2022

Log is spammed with Nvidia related issues but looks like the btrfs balance is stuck, and there still a device missing, type in the console:

btrfs dev del missing /mnt/cache

and post the output

lepis0 · October 17, 2022

Hi,

Nvidia drivers are waiting for reboot of server but I can't reboot yet because I can't stop array

Here are output of command:

root@unraid:~# btrfs dev del missing /mnt/cache
ERROR: unable to start device remove, another exclusive operation 'device remove' in progress

JorgeB · October 17, 2022

It's still trying to remove the device and it's stuck, AFAIK there's no option to cancel that, since the pool is accessible make sure backups are up to date and then reboot by typing "reboot" in the console, if it reboots post new diags after array start.

lepis0 · October 18, 2022

Here are diagnostics attached.unraid-diagnostics-20221018-1942.zip

JorgeB · October 18, 2022

It's still looping on deleting the missing device, first time I see this, suggest backing up and re-formatting the pool.

lepis0 · October 18, 2022

I'm backing up data now.

How can I reformat the pool?

edit. I managed to reformat pool.

Thank you JorgeB for help

Edited October 18, 2022 by lepis0

atvking · May 9, 2023

On 10/18/2022 at 3:36 PM, lepis0 said:

I'm backing up data now.

How can I reformat the pool?

edit. I managed to reformat pool.

Thank you JorgeB for help

Can you share how you managed to reformat the pool if you were unable to stop the array? I'm having the same issue right now and it's looking like a reformat is my only option.

lepis0 · May 9, 2023

If I remember correctly I first backup all data from pool and then stopped array, removed cache pool and created a new one for same disks and this also reformatted it.

Then I restored backup data to new pool.

miicar · August 24, 2023

one of my servers has been doing a disk replace on a pool for a couple weeks now...its still moving extents etc in the log, so im assuming its doing what its meant to. i guess its moving them at 5 1/4" floppy speeds? Not sure why its taking this this long (i added a 4TB drive to replace a 350GB). "btrfs fi show" tells me its moved about 48GB in the last 36 hours...i miss dial up, i think it was faster.

i need to reboot my sever, but im scared to lose the pool if i do...

Edited August 24, 2023 by miicar

JorgeB · August 24, 2023

You can post diags to see if there's something there.

miicar · October 28, 2023

On 10/18/2022 at 11:53 AM, JorgeB said:

It's still looping on deleting the missing device, first time I see this, suggest backing up and re-formatting the pool.

I think i am having the same issue...it's been "removing" a drive for a month, over many reboots. The raid5 (yea i know...call me crazy) BTRFS pool is made up of old retired disks with plenty of repaired sectors and stuff. The pool is not meant to be fast or incredibly safe. It worked great for my needs for the last couple years, as a storj node and surveillance camera recording (archived footage we want to keep, would get moved to the unraid array by the surveillance software); Replacing really bad drives along the way without issue (maybe a day or 2 max for the rebuilds). Now i tried just pulling a smaller HDD, as it was the only one of its size left and i was happy with the current size (and replacement disks if one died).

Recently, unraid started telling me one of the other drives in the pool is missing, but its not...its still there error free and seemingly working.

Now simply hitting reboot doesn't do much while this issue has been going on! The log says one line about rebooting, then it sits there till i type "powerdown -r"...then it ACTUALLY starts rebooting.

I am in the process of backing up and formatting...but i cant get the array to stop, without rebooting with auto-start turned off, to add the backup-disks to a temp pool, so its taking a long time. But hopefully these logs show a possible bug. It could also be i don't really know what i'm doing and royally messed up some things.

Anyway, here's the diags!! Thanks

elmstorage-diagnostics-20231027-2153.zip

(i should add that i was in the middle of trying to manually stop the array when i took these diags...so some things might be extra)

Edited October 28, 2023 by miicar
more info as im trying to back up and restart this pool

JorgeB · October 28, 2023

6 hours ago, miicar said:

but i cant get the array to stop

Besides the btrfs pool, this one is also not unmounting:

Oct 27 21:43:38 ELMSTORAGE root: cannot unmount '/mnt/cache/Docker': pool or dataset is busy
Oct 27 21:43:38 ELMSTORAGE root: cannot unmount '/mnt/cache': pool or dataset is busy

miicar · October 28, 2023

8 hours ago, JorgeB said:
Besides the btrfs pool, this one is also not unmounting:
Oct 27 21:43:38 ELMSTORAGE root: cannot unmount '/mnt/cache/Docker': pool or dataset is busy
Oct 27 21:43:38 ELMSTORAGE root: cannot unmount '/mnt/cache': pool or dataset is busy

I think that is due to storj DB files being on the (SSD) cache, and there is a storj process that hates BTRFS being so slow during a rebuild/degraded drive. My guess is thats what was holding it back. I can post todays diag. Array is in production and running fine overall; still moving files off this pool to change it to a z-pool, but its painfully slow moving.

elmstorage-diagnostics-20231028-1251.zip

JorgeB · October 29, 2023

Mover can be very slow with many small files.

miicar · October 30, 2023

i have avoided mover for that reason...trying to use the file manager plug in (since krusader doesn't seem to like to access some of the folders). its been at 95% for a day now...no movement in the space on either disk tho...

miicar · November 2, 2023

figured out how to stop the "remove" procedure, so the backup is going a tiny bit faster now. I thought i would try removing the "missing" devid 1 that shows when i type "btrfs fi show". It has zero used space and so it should just go and allow the pool to balance properly right?? I typed, "btrfs device remove "devid 1" /mnt/(pool name)" but it tells me, "ERROR: not a block device: devid 1". I am probably typing that syntax wrong...(i'll admit, i made that line up from reading --help commands.) I would like to keep this BTRFS, but i keep running into stability issues with this fs (partly from lack of proper understanding, i am sure).

Edited November 2, 2023 by miicar
More accurate info.

JorgeB · November 2, 2023

Try

btrfs device remove missing  /mnt/(pool name)

Pool must be balance to a profile that allows a device removal

miicar · November 3, 2023

13 hours ago, JorgeB said:
Try
btrfs device remove missing  /mnt/(pool name)

ok yea I tried that last night too...this was its reply:

:~# btrfs device remove missing  /mnt/Servernstorj
ERROR: cannot access '/mnt/Servernstorj': No such file or directory

Tried with all caps, as it shows and all lowercase...no dice.

I also tried pulling the disks out of the pool assignment, and putting them in UD to mount (something that i done before to rescue my last crashed pool). It wouldn't mount; in UD or by CLI. Something about missing profile tree...(i should have took a screenshot). So i put it back as it was in the pool, and it mounts immediately (degraded and slow of course).

====

This pool is going to get smashed and rebuilt as soon as the painfully slow file move is done, but if ya'll wanna de-bug it before i do that i'm game to play along! Or maybe my use-case is so rare that its not really worth troubleshooting for the average user (mixed bag of random sized, not so healthy sata, and sometimes even IDE drives, in the forbidden BTRFS raid5). I have multiple pools that are for production, and contain perfect drives that get swapped out at the first sign of distress, those pools don't give me any issues (although im moving to the ZFS now for most of the pools now that unraid supports it).

I don't know if this is more a BTRFS issue, or how unraid is asking it to do things. This is the second time i have had a drive removal end up destroying a BTRFS pool in unraid, using the listed methods to replace or remove a drive, through the GUI.

====

My personal takeaway in all this, is that i'm going to do these operations through CLI going forward. I think that will entice me to look at the status of the pool between steps, and catch potential issues, before i compound more issues on top; something unraids gui doesn't really give you much insight on between rebuilds. and really, i don't want it too give more info in the GUI side of things...more stuff happening in the background means slower OS in the long run. i also wonder if that will allow me to manage pools whilst the rest of the system is still live...which would be wonderful. I choose this path. Guess I gotta learn the proper way to walk it like a proper Penguin!

Thx,

C

Edited November 3, 2023 by miicar
more rambling

JorgeB · November 3, 2023

Please post the diagnostics

miicar · November 3, 2023

so. late last night i accidentally rebooted this server, attempting to reboot another server (got my browsers and sleep mixed up). After reboot, the BTRFS operation started automatically again, i had to go to work, so i left it. Just got home, its done!

watch -n 10 sudo btrfs fi show
Label: 'Emergency_ONLY_UNraid_Spare_Drive_(SmartErrors)'  uuid: c9fdb522-0c4c-4b78-89e7-c9518d596bf1
        Total devices 5 FS bytes used 1.77TiB
        devid    2 size 931.51GiB used 552.00GiB path /dev/sdh1
        devid    4 size 931.51GiB used 552.00GiB path /dev/sdi1
        devid    7 size 931.51GiB used 552.03GiB path /dev/sdb1
        devid    8 size 3.64TiB used 569.03GiB path /dev/sdj1
        devid    9 size 3.64TiB used 467.00GiB path /dev/sdl1

No longer shows a missing device. The space used isn't the same across like sized disks, as i would expect it to be. Might attempt a balance and see what happens. Is there things i can check so i can trust this pool without formatting and starting over?

It's still moving files uselessly slow, but i think that its is partly due to the tiny file size of each file as well as one of the drives might be quitting.
Here is today's diag...

elmstorage-diagnostics-20231103-1917.zip

Edited November 4, 2023 by miicar

JorgeB · November 4, 2023

Looks fine to me, you can run a balance but IMHO not much point, it's pretty well balanced, just one drive has more metadata.

miicar · November 4, 2023

10 hours ago, JorgeB said:

Looks fine to me, you can run a balance but IMHO not much point, it's pretty well balanced, just one drive has more metadata.

fair enough. I guess the question remains, can i trust this pool again? or should i keep moving the millions of tiny files off the pool and rebuild it? ~~I'm trying to figure out why i cant break the 20MB/s R/W wall...cant even get close most of the time (and not the tiny files, just in general).~~ (i was wrong about the speed issues. Seems to be able to go over 100MB/s again. Which, for the drives im using, is fine).

It seems (from my own and reading through others experiences), removing a drive from a BTRFS pool is a risky move to do. Sometimes the FS doesn't wanna let it go (its happened to me on 2 different pools now). Drive adds and replacement, while slow in raid5, have completed without issue. But any issue i have had with BTRFS revolves around removing a drive. It always seems to leave stuff dangling behind, in my experience. Then (before i knew to check), i would add/swap another drive and all hell would break loose!

Edited November 4, 2023 by miicar
Corrected my lies, and telling some more

JorgeB · November 5, 2023

17 hours ago, miicar said:

It seems (from my own and reading through others experiences), removing a drive from a BTRFS pool is a risky move to do.

Using btrfs raid5/6 is risky in general, since it's considered experimental, unless you really need the flexibility and can live with the risk I would recommend converting to zfs raidz.

miicar · November 5, 2023

8 hours ago, JorgeB said:

Using btrfs raid5/6 is risky in general, since it's considered experimental, unless you really need the flexibility and can live with the risk I would recommend converting to zfs raidz.

Yea, the reason this pool exists is to make use the most space of random HDDs. I accept the risk and only use it for non-critical data (steam games, printer scans, wallpaper share, surveillance camera cache, etc).

Can't stop array (BTRFS operation is running)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation