Jump to content

Can't stop array (BTRFS operation is running)


lepis0
Go to solution Solved by JorgeB,

Recommended Posts

Hi,

I changed one of my cache disks to new larger one.

 

Now when I start array I can't stop it because it says disabled BTRFS operation is running.

Syslog has thousands of rows 

unraid kernel: BTRFS info (device sdh1): found 1 extents, stage: update data pointers

 

Cache balance show:

image.png.f98f3531bc39d9f2c194b9710d406d54.png

Balance button does nothing.

 

Pool also looks normal:

image.thumb.png.b84bb5bbf3b4d1bb2c7501765a631bcd.png

 

I had waited about week now for that BTRFS operation to stop.

 

I took diagnostic zip also and it's attached.

 

What should I do to get my cache to back normal state?

 

unraid-diagnostics-20221017-1449.zip

Link to comment

Hi,

Nvidia drivers are waiting for reboot of server but I can't reboot yet because I can't stop array :D

 

Here are output of command:

root@unraid:~# btrfs dev del missing /mnt/cache
ERROR: unable to start device remove, another exclusive operation 'device remove' in progress

 

Link to comment
  • 6 months later...
On 10/18/2022 at 3:36 PM, lepis0 said:

I'm backing up data now.

How can I reformat the pool?

 

 

edit. I managed to reformat pool.

 

Thank you JorgeB for help :)

 Can you share how you managed to reformat the pool if you were unable to stop the array? I'm having the same issue right now and it's looking like a reformat is my only option.

Link to comment
  • 3 months later...

one of my servers has been doing a disk replace on a pool for a couple weeks now...its still moving extents etc in the log, so im assuming its doing what its meant to. i guess its moving them at 5 1/4" floppy speeds?  Not sure why its taking this this long (i added a 4TB drive to replace a 350GB).  "btrfs fi show" tells me its moved about 48GB in the last 36 hours...i miss dial up, i think it was faster.

i need to reboot my sever, but im scared to lose the pool if i do...

Edited by miicar
Link to comment
  • 2 months later...
On 10/18/2022 at 11:53 AM, JorgeB said:

It's still looping on deleting the missing device, first time I see this, suggest backing up and re-formatting the pool.

I think i am having the same issue...it's been "removing" a drive for a month, over many reboots.  The raid5 (yea i know...call me crazy) BTRFS pool is made up of old retired disks with plenty of repaired sectors and stuff.  The pool is not meant to be fast or incredibly safe.  It worked great for my needs for the last couple years, as a storj node and surveillance camera recording (archived footage we want to keep, would get moved to the unraid array by the surveillance software); Replacing really bad drives along the way without issue (maybe a day or 2 max for the rebuilds). Now i tried just pulling a smaller HDD, as it was the only one of its size left and i was happy with the current size (and replacement disks if one died).

Recently, unraid started telling me one of the other drives in the pool is missing, but its not...its still there error free and seemingly working.  
 

Now simply hitting reboot doesn't do much while this issue has been going on! The log says one line about rebooting, then it sits there till i type "powerdown -r"...then it ACTUALLY starts rebooting.  

I am in the process of backing up and formatting...but i cant get the array to stop, without rebooting with auto-start turned off, to add the backup-disks to a temp pool, so its taking a long time.  But hopefully these logs show a possible bug.  It could also be i don't really know what i'm doing and royally messed up some things.  

Anyway, here's the diags!!  Thanks

elmstorage-diagnostics-20231027-2153.zip

 

(i should add that i was in the middle of trying to manually stop the array when i took these diags...so some things might be extra)

Edited by miicar
more info as im trying to back up and restart this pool
Link to comment
8 hours ago, JorgeB said:

Besides the btrfs pool, this one is also not unmounting:

 

Oct 27 21:43:38 ELMSTORAGE root: cannot unmount '/mnt/cache/Docker': pool or dataset is busy
Oct 27 21:43:38 ELMSTORAGE root: cannot unmount '/mnt/cache': pool or dataset is busy

 

 

I think that is due to storj DB files being on the (SSD) cache, and there is a storj process that hates BTRFS being so slow during a rebuild/degraded drive.  My guess is thats what was holding it back.  I can post todays diag.   Array is in production and running fine overall; still moving files off this pool to change it to a z-pool, but its painfully slow moving. 

elmstorage-diagnostics-20231028-1251.zip

Link to comment

figured out how to stop the "remove" procedure, so the backup is going a tiny bit faster now.  I thought i would try removing the "missing" devid 1 that shows when i type "btrfs fi show".  It has zero used space and so it should just go and allow the pool to balance properly right??   I typed, "btrfs device remove "devid 1"  /mnt/(pool name)" but it tells me, "ERROR: not a block device: devid 1".  I am probably typing that syntax wrong...(i'll admit, i made that line up from reading --help commands.)  I would like to keep this BTRFS, but i keep running into stability issues with this fs (partly from lack of proper understanding, i am sure).

Edited by miicar
More accurate info.
Link to comment
13 hours ago, JorgeB said:

Try 

btrfs device remove missing  /mnt/(pool name)

ok yea I tried that last night too...this was its reply:

 

:~# btrfs device remove missing  /mnt/Servernstorj
ERROR: cannot access '/mnt/Servernstorj': No such file or directory


Tried with all caps, as it shows and all lowercase...no dice.


I also tried pulling the disks out of the pool assignment, and putting them in UD to mount (something that i done before to rescue my last crashed pool).  It wouldn't mount; in UD or by CLI.  Something about missing profile tree...(i should have took a screenshot).  So i put it back as it was in the pool, and it mounts immediately (degraded and slow of course).

 

====


This pool is going to get smashed and rebuilt as soon as the painfully slow file move is done, but if ya'll wanna de-bug it before i do that i'm game to play along!  Or maybe my use-case is so rare that its not really worth troubleshooting for the average user (mixed bag of random sized, not so healthy sata, and sometimes even IDE drives, in the forbidden BTRFS raid5).  I have multiple pools that are for production, and contain perfect drives that get swapped out at the first sign of distress, those pools don't give me any issues (although im moving to the ZFS now for most of the pools now that unraid supports it).

 

I don't know if this is more a BTRFS issue, or how unraid is asking it to do things.  This is the second time i have had a drive removal end up destroying a BTRFS pool in unraid, using the listed methods to replace or remove a drive, through the GUI. 

 

====

My personal takeaway in all this, is that i'm going to do these operations through CLI going forward.  I think that will entice me to look at the status of the pool between steps, and catch potential issues, before i compound more issues on top; something unraids gui doesn't really give you much insight on between rebuilds.  and really, i don't want it too give more info in the GUI side of things...more stuff happening in the background means slower OS in the long run.  i also wonder if that will allow me to manage pools whilst the rest of the system is still live...which would be wonderful.  I choose this path.  Guess I gotta learn the proper way to walk it like a proper Penguin!

 

Thx,

C

Edited by miicar
more rambling
Link to comment

so.  late last night i accidentally rebooted this server, attempting to reboot another server (got my browsers and sleep mixed up).  After reboot, the BTRFS operation started automatically again, i had to go to work, so i left it.  Just got home, its done! 

watch -n 10 sudo btrfs fi show
Label: 'Emergency_ONLY_UNraid_Spare_Drive_(SmartErrors)'  uuid: c9fdb522-0c4c-4b78-89e7-c9518d596bf1
        Total devices 5 FS bytes used 1.77TiB
        devid    2 size 931.51GiB used 552.00GiB path /dev/sdh1
        devid    4 size 931.51GiB used 552.00GiB path /dev/sdi1
        devid    7 size 931.51GiB used 552.03GiB path /dev/sdb1
        devid    8 size 3.64TiB used 569.03GiB path /dev/sdj1
        devid    9 size 3.64TiB used 467.00GiB path /dev/sdl1

No longer shows a missing device.  The space used isn't the same across like sized disks, as i would expect it to be.  Might attempt a balance and see what happens.  Is there things i can check so i can trust this pool without formatting and starting over?

It's still moving files uselessly slow, but i think that its is partly due to the tiny file size of each file as well as one of the drives might be quitting. 
Here is today's diag...

 

 

elmstorage-diagnostics-20231103-1917.zip

Edited by miicar
Link to comment
10 hours ago, JorgeB said:

Looks fine to me, you can run a balance but IMHO not much point, it's pretty well balanced, just one drive has more metadata.

fair enough.  I guess the question remains, can i trust this pool again?  or should i keep moving the millions of tiny files off the pool and rebuild it?  I'm trying to figure out why i cant break the 20MB/s R/W wall...cant even get close most of the time (and not the tiny files, just in general). (i was wrong about the speed issues.  Seems to be able to go over 100MB/s again. Which, for the drives im using, is fine).

It seems (from my own and reading through others experiences), removing a drive from a BTRFS pool is a risky move to do.  Sometimes the FS doesn't wanna let it go (its happened to me on 2 different pools now).  Drive adds and replacement, while slow in raid5, have completed without issue.  But any issue i have had with BTRFS revolves around removing a drive.  It always seems to leave stuff dangling behind, in my experience.   Then (before i knew to check), i would add/swap another drive and all hell would break loose!

Edited by miicar
Corrected my lies, and telling some more
Link to comment
17 hours ago, miicar said:

It seems (from my own and reading through others experiences), removing a drive from a BTRFS pool is a risky move to do. 

Using btrfs raid5/6 is risky in general, since it's considered experimental, unless you really need the flexibility and can live with the risk I would recommend converting to zfs raidz.

  • Like 1
Link to comment
8 hours ago, JorgeB said:

Using btrfs raid5/6 is risky in general, since it's considered experimental, unless you really need the flexibility and can live with the risk I would recommend converting to zfs raidz.

Yea, the reason this pool exists is to make use the most space of random HDDs.  I accept the risk and only use it for non-critical  data (steam games, printer scans, wallpaper share, surveillance camera cache, etc).  

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...