[SOLVED] Rebuild Cache Pool


28 posts in this topic Last Reply

Recommended Posts

Hi,

 

dont know how it happens, but one of my cache drives disappears after a reboot.

After another reboot it's back but only as a "unassigned device".

 

The other cache drive says "unmountable - no file system"

 

I make a sneak what happens when stopping the array and attach the drive as a cache again --> Warning: All data on the drive will deleted! 

 

Now I'm stuck.. what to do now?

 

 

Bildschirmfoto 2020-07-01 um 21.56.07.jpg

Edited by Maddeen
Link to post

@johnnie.black - here are the diagnostics.

Hopefully you can help me because my isos share was also on the cache pool because I thought its safe while using a pool.

AppData was backuped successfully with the CA-Tool. So that's fine.

 

But for now (logically) all shares (appdata, domains, isos and system) are gone ...

And - when it gets worse and I (or you) cant restore those shares, I need some help to start from scratch.

Don't think thats it just a "build new cache pool" and restore Appdata with the CA-Tool...

 

v1ew-s0urce-diagnostics-20200702-1235.zip

Link to post

Doesn't look very good since the pool is being detected as a single device, if it was a redundant pool (default) you can try mounting just the other device, to do that try this:

 

Stop the array, if Docker/VM services are using the cache pool disable them, unassign the current cache device, start array to make Unraid "forget" current cache config, stop array, reassign the other cache device (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), start array.

 

If it doesn't work post new diags.

 

 

Link to post

Ok -- sadly I have to work for about 3 hours before beeing home :)

But then I'll try it.

Docker and VM are not useable because this services are also configurated to run only on cache drive!

 

So to be sure that I'm doing everything correct step for step

1) Stop array

2) unassign the current cache disk (sdb!) --> so no cache drives are configured

3) Start array with no cache disks to make unraid forget any cache config

4) Stop array

5) Attach both cache disks (sdb and sdc) --> You said "there can't be an warning" ... dont you mean "there can be an warning"?!

Because I already tried to assign the unassigned drive as a cache drive yesterday and it immediatly showed me the "all existing data..." warning.

6) Start array

 

Link to post
6 minutes ago, Maddeen said:

5) Attach both cache disks (sdb and sdc)

No, assign only the other device ( currently sdc)

 

8 minutes ago, Maddeen said:

You said "there can't be an warning" ... dont you mean "there can be an warning"?!

I mean there can't, that's why we start the array first without any cache device assigned.

Link to post

@johnnie.black - done.

But now I get the error "Unmountable: No pool uuid" (see screen - new diagnostics attached)

1756309123_Bildschirmfoto2020-07-02um16_27_20.thumb.png.90b710bd180eff6fbf32e47ea73ebc24.png

 

Additionally - the other cache drive (sdb) is now "un-unmountable" as you can see in the screen. Everything greyed out.

 

And I get those two mail-warnings:

 

First:

Event: Unraid Cache disk error
Subject: Alert [V1EW-S0URCE] - Cache disk in error state (disk missing)
Description: No device identification ()
Importance: alert

Second:

Event: Unraid Cache disk message
Subject: Notice [V1EW-S0URCE] - Cache disk returned to normal operation
Description: Samsung_SSD_850_EVO_500GB_S2RBNXAH243889M (sdc)
Importance: normal

v1ew-s0urce-diagnostics-20200702-1629.zip

Edited by Maddeen
Link to post

Yep, like suspected doesn't look good, that device was removed from the pool, and the other one has a damaged superblock, you can try these recovery options, you can try them against both devices but if it works it should be with sdb.

Link to post

thx - i'll try that. 
If it's not possible, is there any best practice to start from the scratch?

My plan would be

  1. Adding both cache drives
  2. Formating and bringing unraid in a clean status
  3. Creating folders (appdata, domains, isos and system) at the cache drive
  4. Setting up the shares
  5. Restore AppData via CA Plugin
  6. Deactivate docker service and activate it (to write new docker.img at the cache drive)
  7. Download needed dockers - all settings should be restored automatically (if I remembered right)
  8. Deactive VM service and activate it (to write new libvirt.img at cache drive)
  9. Done

And just for my information, because I'm not a specialist, let me ask the following....

 

Do you have any idea what causes my problem? I dont want to have this again.

Or is there any option to periodically backup the shares (domains, isos and system) to the array?

 

Why is the cache pool configurated as a raid 1 but wasnt restorable?

I doesnt see the benefit of a pool when - in case of crashs like mine - all data is lost... 

 

Thank you very much for all your explanations and the time you spend on helping me!! 

Link to post
16 minutes ago, Maddeen said:

My plan would be

Looks OK to me, docker image should be recreated.

 

17 minutes ago, Maddeen said:

Do you have any idea what causes my problem? I dont want to have this again.

The problem was cerated after starting the array with a missing cache device, this removed one of the devices from the pool, though the remaining device should still work assuming it was a redundant pool, but there's some issue with the superblock, I already asked LT to not allow the array to auto-start with a missing cache device, and hopefully that will be implemented soon, but for now and for anyone using a cache pool it's best to disable array auto-start, and always check every device is present before starting.

 

20 minutes ago, Maddeen said:

Why is the cache pool configurated as a raid 1 but wasnt restorable?

There's an issue with the remaining cache device superblock, not sure why, it's not common, a btrfs maintainer might be able to help restore it without data loss, you'd need to ask for help on IRC or the mailing list, both are mentioned on the FAQ linked earlier.

Link to post

Thanks again for your help. 

Hopefully LT will (or already has) hear your voice, because thats exactly what I thought when reading your explanation.

Why does unraid perform an array autostart when it recognized that drives are missing and a potential data loss might happen? 🥴
That behavior makes the complete "pool feature" useless and suggest a redundancy which is not there -  imho!

Is there any feature request queue where I can give my points to that and affect the priority of implementation? 

 

Until that I'll follow your advice and disable array auto-start after power down / reboot. 

I'll read trough your link -- luckily there are no sensitive data on the cache and before I waste a btrfs maintainers time for "no very urgent data", I'll start from scratch. 
Appdata was backed-up with CA-Plugin.
Docker container settings will be restored by default.
Domains - I got none ;-)
And the isos share only included the Win10 installation image and 2-3 virtIO.imgs. 

 

I'll post an update, if I could restore my data ... or if not.. :)  

But as I said - it`s not the end of the world!! 

 

Thank you very much!! 🤙

Edited by Maddeen
Link to post
33 minutes ago, Maddeen said:

Is there any feature request queue where I can give my points to that and affect the priority of implementation? 

I recently traded messages with Tom about this, but it won't hurt adding a post to the feature request.

Link to post

@johnnie.black - seems that I'm a lucky boy. (see screenshot) :) 

 

But - as you can see in the screenshot - my cmd-line skills are pretty shitty. 🙈

So before I self-destruct my luck ... is this the correct command for copying the folders to my array?
 

cp -r /appdata/ /mnt/user/unraid_backup/appdata/
cp -r /system/ /mnt/user/unraid_backup/system/
cp -r /isos/ /mnt/user/unraid_backup/isos/

 

Actually there is no folder "appdata, system or isos" in the target folder "unraid_backup" - so does it work as described or do I need to first create the folders at the target first?

 

Bildschirmfoto 2020-07-02 um 20.09.00.png

Link to post

mhh but my dockers are on that cache drive 🙈 for now I cant even start the docker service because all necessary ressources are on the cache disks... 

 

or do you prefer that I first set up docker to run on the array - -copy files - and then move all docker ressources to the cache?

Link to post

mhh i started midnight commander via the webterminal but not able to use "copy command" because this needs a press of F5 ... and F5 is reload page.
For heavens sake .. solution right in front of my nose and not able to catch it. 

I'll look for a workaround to disable F5 for browser so I can use F5 to start copying... 

 

UPDATE: 

Got it !!! ESC + number = F-Key... 

In my case --> ESC+5 = F5 --- copying starts 🙌

Edited by Maddeen
Link to post

@johnnie.black - ok - now i'm completely confused. 

I copied all data to my array. Stopped the array. Add both cache drives. Formated the one, where unRAID said "Unmountable". 

And ... magic ... my pool is restored. All data is back ... including VMs and Docker containers 🤔 WTF?! 

 

Currently he's doing something, because first I received this warning

Event: Unraid Cache disk message
Subject: Warning [V1EW-S0URCE] - Cache pool BTRFS too many profiles (You can ignore this warning when a cache pool balance operation is in progress)
Description: Samsung_SSD_850_EVO_500GB_S2RBNX0HB13914M (sdb)
Importance: warning

 

I checked the system log -.- somethings going on like that

Jul 2 22:04:46 v1ew-s0urce kernel: BTRFS info (device sdb1): found 703 extents
Jul 2 22:04:46 v1ew-s0urce kernel: BTRFS info (device sdb1): found 703 extents
Jul 2 22:04:46 v1ew-s0urce kernel: BTRFS info (device sdb1): found 703 extents
Jul 2 22:04:46 v1ew-s0urce kernel: BTRFS info (device sdb1): relocating block group 198764920832 flags data
Jul 2 22:04:51 v1ew-s0urce kernel: BTRFS info (device sdb1): found 584 extents
Jul 2 22:04:51 v1ew-s0urce kernel: BTRFS info (device sdb1): found 584 extents
Jul 2 22:04:51 v1ew-s0urce kernel: BTRFS info (device sdb1): found 584 extents
Jul 2 22:04:51 v1ew-s0urce kernel: BTRFS info (device sdb1): relocating block group 197691179008 flags data
Jul 2 22:04:56 v1ew-s0urce kernel: BTRFS info (device sdb1): found 1069 extents
Jul 2 22:04:56 v1ew-s0urce kernel: BTRFS info (device sdb1): found 1069 extents
Jul 2 22:04:56 v1ew-s0urce kernel: BTRFS info (device sdb1): relocating block group 196617437184 flags data
Jul 2 22:05:01 v1ew-s0urce kernel: BTRFS info (device sdb1): found 926 extents
Jul 2 22:05:01 v1ew-s0urce kernel: BTRFS info (device sdb1): found 926 extents
Jul 2 22:05:01 v1ew-s0urce kernel: BTRFS info (device sdb1): relocating block group 195543695360 flags data
Jul 2 22:05:06 v1ew-s0urce kernel: BTRFS info (device sdb1): found 394 extents
Jul 2 22:05:07 v1ew-s0urce kernel: BTRFS info (device sdb1): found 394 extents
Jul 2 22:05:07 v1ew-s0urce kernel: BTRFS info (device sdb1): relocating block group 194469953536 flags data
Jul 2 22:05:12 v1ew-s0urce kernel: BTRFS info (device sdb1): found 554 extents
Jul 2 22:05:12 v1ew-s0urce kernel: BTRFS info (device sdb1): found 554 extents
Jul 2 22:05:12 v1ew-s0urce kernel: BTRFS info (device sdb1): relocating block group 193396211712 flags data
Jul 2 22:05:17 v1ew-s0urce kernel: BTRFS info (device sdb1): found 673 extents
Jul 2 22:05:17 v1ew-s0urce kernel: BTRFS info (device sdb1): found 673 extents

 

And when clicking on the first cache drive --> btrfs balance status: is running... 

 

Is now everything fine?! Or do I need to double check some things?? Or running any self checking/repair mechanisms? 
I'm not sure if my current "status" is safe or if it's like a damaged car... running the last miles and the next little hickup will result in a complete crash :)

Link to post
10 hours ago, Maddeen said:

And ... magic ... my pool is restored. All data is back ... including VMs and Docker containers 🤔 WTF?! 

That's weird, please post diags if you didn't rebooted after.

 

10 hours ago, Maddeen said:

Is now everything fine?!

What I would guess is that it recovered/replaced the damaged superblock and it's now balancing the data to the other device, if it's showing data and your pool was redundant all data should be correct.

Link to post

@johnnie.black - weird -- that's the correct wording :) 

No restart since yesterday - here's the new diagnostics file

 

 

But what I dont understand is this information (screenshot) ... why it says that there is only 90GB (total) space and 86GB used (I've got two 500GB SSDs as cache drives at RAID1)

 

And below it says "Data to scrub = 173GB" ...

and - to fulfill the confusing - at the "Main-Tab" it says, that the cache drive is used for about 93GB. (screenshot #2)

All those information doesnt fit or aren't consistent - imho... 

 

Bildschirmfoto 2020-07-03 um 16.02.21.png

Bildschirmfoto 2020-07-03 um 16.10.08.png

v1ew-s0urce-diagnostics-20200703-1600.zip

Link to post
11 minutes ago, Maddeen said:

why it says that there is only 90GB (total)

That's normal with btrsf, it's the the allocated size, btrfs first creates empty data and metadata chunks before writing data there, so there are 90GiB of chunks on disk and 86GiB are used.

 

12 minutes ago, Maddeen said:

And below it says "Data to scrub = 173GB" ...

raid1, so 86GiB x 2

 

12 minutes ago, Maddeen said:

to fulfill the confusing - at the "Main-Tab" it says, that the cache drive is used for about 93GB.

Because the GUI uses GB, not GiB, 86Gib=93GB

Link to post
17 minutes ago, Maddeen said:

weird -- that's the correct wording

I can't really understand what happened, and why the pool recovered after a format attempt, I can see wiping a device failed because it was busy, that's likely part of what helped, still most likely you were just lucky, that should never happen.

Link to post
23 hours ago, johnnie.black said:

I can't really understand what happened .... most likely you were just lucky

 

Haha - sometimes you only need luck ..

Thanks again for your valuable feedback! Helped me a lot killing my blind spots with unraid.

Edited by Maddeen
Link to post
  • 11 months later...

Faced a similar experience when a reboot with a bad sata card occured. Knocking out my entire cache array.

 

Because no actual drives was lossed, after rewiring to another sata port (SATA 2 sadly). I am able to see the entire "btrfs filesystem" even if I unable to add them back to unraid (as they are unassigned, and shows a warning that all data will be formatted when reassigning)

 

$ btrfs filesystem show
Label: none  uuid: 0bfdf8d7-1073-454b-8dec-5a03146de885
        Total devices 6 FS bytes used 1.37TiB
        devid    2 size 111.79GiB used 37.00GiB path /dev/sdo1
        devid    3 size 223.57GiB used 138.00GiB path /dev/sdm1
        devid    4 size 223.57GiB used 138.00GiB path /dev/sdi1
        devid    5 size 1.82TiB used 1.60TiB path /dev/sdd1
        devid    6 size 1.82TiB used 1.60TiB path /dev/sde1
        devid    7 size 111.79GiB used 37.00GiB path /dev/sdp1
        
 ... there are probably other BTRFS disk drives if you have theme as well ...

 

While attempting to remount this cache pool using the steps found at 

I was unfortunately faced with an error of 

$ mount -o degraded,usebackuproot,ro /dev/sdo1 /dev/sdm1 /dev/sdi1 /dev/sdd1 /dev/sde1 /dev/sdp1  /recovery/cache-pool
mount: bad usage

 

 

So alternatively I mount using the UUID (with /recovery/cache-pool being the the recovery folder i created)

 

$ mount -o degraded,usebackuproot,ro --uuid 0bfdf8d7-1073-454b-8dec-5a03146de885  /recovery/cache-pool

 

With that i presume i can then safely remove the drives from the cache pool (for the last 2 disk that was left), and slowly manually reorganize and recover the data.

 

Edited by PicoCreator
Link to post
14 minutes ago, PicoCreator said:

I am able to see the entire "btrfs filesystem" even if I unable to add them back to unraid (as they are unassigned, and shows a warning that all data will be formatted when reassigning)

You can do this:

 

Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array.

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.