Rebooted, lost cache disk, BTRFS operation running?

johnsanc · October 13, 2019

I'm not really sure what happened... I woke up this morning and most of my dockers and VMs were disabled, I had what looked like some I/O errors with one of my cache drives in the cache pool. I rebooted and when the system came back up I only see one cache drive now and there is a BTRFS operation running, disk log below. I can also see on my cache disk tab that there is only one disk and the size still shows as 2TB but it appears to be freeing up space and reducing used space... I'm hoping this is just rebalancing back to one disk or something but I cannot clearly tell from the UI what is going on.

What exactly is happening here and how do I get my cache pool back to normal?

I assumed I should just let this run instead of trying to stop it. I fear I will lose all my cache data if I try to interfere.

Oct 13 10:55:36 Tower kernel: ata7: SATA max UDMA/133 abar m512@0xecf00000 port 0xecf00100 irq 48
Oct 13 10:55:36 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct 13 10:55:36 Tower kernel: ata7.00: supports DRM functions and may not be fully accessible
Oct 13 10:55:36 Tower kernel: ata7.00: ATA-11: Samsung SSD 860 EVO 1TB, [REDACTED], [REDACTED], max UDMA/133
Oct 13 10:55:36 Tower kernel: ata7.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Oct 13 10:55:36 Tower kernel: ata7.00: supports DRM functions and may not be fully accessible
Oct 13 10:55:36 Tower kernel: ata7.00: configured for UDMA/133
Oct 13 10:55:36 Tower kernel: ata7.00: Enabling discard_zeroes_data
Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Write Protect is off
Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Mode Sense: 00 3a 00 00
Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 13 10:55:36 Tower kernel: ata7.00: Enabling discard_zeroes_data
Oct 13 10:55:36 Tower kernel: sdr: sdr1
Oct 13 10:55:36 Tower kernel: ata7.00: Enabling discard_zeroes_data
Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Attached SCSI disk
Oct 13 10:55:36 Tower kernel: BTRFS: device fsid ec85aae9-18b3-4066-9749-8195b3bee6e8 devid 1 transid 14803447 /dev/sdr1
Oct 13 10:55:52 Tower emhttpd: Samsung_SSD_860_EVO_1TB_[REDACTED] (sdr) 512 1953525168
Oct 13 10:55:52 Tower emhttpd: import 30 cache device: (sdr) Samsung_SSD_860_EVO_1TB_[REDACTED]
Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): allowing degraded mounts
Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): disk space caching is enabled
Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): has skinny extents
Oct 13 10:56:02 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing
Oct 13 10:56:02 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing
Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): bdev (null) errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): bdev /dev/sdr1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): enabling ssd optimizations
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): resizing devid 1
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): new size for /dev/sdr1 is 1000204853248
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): relocating block group 14118149029888 flags data
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): relocating block group 14117008179200 flags data|raid1
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents
Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): relocating block group 14112230211584 flags data|raid1
etc...

Edited October 13, 2019 by johnsanc

jpowell8672 · October 13, 2019

johnsanc · October 13, 2019

Thanks - it just finished rebalancing and it did not do what I assumed it would do... Instead of balancing to a single drive now it looks like I have a single "Cache" which is twice the size of my physical drive. If I shut down the array I don't see my other disk even available for slot2.

How can the size of my cache be twice the size?

Why does unraid AUTOMATICALLY rebalance when the array starts in this scenario? That seems like an action that the user should have to initiate.

Now when I stop the array and try to restart it I cannot even restart. My only option is to reboot or power down. I get a warning at the bottom saying I have a stale configuration.

Edited October 13, 2019 by johnsanc

jpowell8672 · October 13, 2019

What was your cache pool raid1, raid0, etc.? Did you have your cache backed up? What kind of drives are you using for cache and how are they connected? Could be a bad cable or drive failure.

johnsanc · October 13, 2019

OK I rebooted and now my other drive is available to add to cache, but I get a warning that says "All existing data on this device will be OVERWRITTEN when array is Started". If I were to add my disk back would it rebalance back to the original state?

I use two 1TB Samsung EVO SSDs

These were in RAID1, mirrored

I never did anything to change the RAID setting, whatever happened happened automatically when I rebooted the first time.

johnsanc · October 13, 2019

OK - well I did not add a 2nd disk, I started the Array to backup the disk but now its rebalancing AGAIN... what is going on?!

jpowell8672 · October 13, 2019

jpowell8672 · October 13, 2019

jpowell8672 · October 13, 2019

https://wiki.unraid.net/Replace_A_Cache_Drive

johnsanc · October 13, 2019

So basically back everything up to the array and start over?

Also, is there an explanation for what happened in my case? Why did the cache size appear twice as large as the single physical disk remaining?

Edited October 13, 2019 by johnsanc

John_M · October 13, 2019

1 hour ago, johnsanc said:

Also, is there an explanation for what happened in my case?

One of the disks in your cache pool dropped offline:

Oct 13 10:56:02 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing

probably due to a cable problem. You ought to have grabbed diagnostics at that point (Tools -> Diagnostics) and posted the resulting zip file. Syslog fragments are of very limited use - in this case it showed what happened, but not why. When Unraid notices a disk missing from the cache pool it will rebalance the pool to protect the data on the remaining disk(s) in the pool as much as it can.

If I understand correctly, the missing disk has now reappeared after a reboot and you want to add it back to the pool? Before you do, I recommend shutting down, checking its cables, then rebooting and grabbing diagnostics. If the disk is properly recognised and in good health then you can re-allocate it to its former slot and it will be reformatted and the pool will be rebalanced again. When it has completed rebalancing everything should be as it was before. That's the meaning of the message that the disk will be overwritten - it will be overwritten with data from the remaining healthy members of the pool, which is exactly what you want.

Edited October 13, 2019 by John_M
typos

johnsanc · October 13, 2019

Yes thats what I thought, but I was bitten before where the entire pool was overwritten instead of just the disk so I am extra cautious. Right now I'm backing everything up to my array just in case something goes south when trying to rebuild my pool.

Any explanation as to why it showed my cache as 2TB even though I only had a single 1TB disk at the time? That part I really don't understand.

John_M · October 13, 2019

2 minutes ago, johnsanc said:

Any explanation as to why it showed my cache as 2TB even though I only had a single 1TB disk at the time?

I recall it being mentioned somewhere before and being confirming as a bug. If you're worried that the healthy members of your cache pool might be overwritten by the bad data on your once missing, now returned disk you can wipe it. You should see it under the Unassigned devices heading on the Main page. Make a note of its device name, sdX. In a terminal window type

wipefs /dev/sdX

and double, triple check before you press RETURN that you're wiping the correct disk. Then reassign it to its rightful slot in the cache pool and let the system format it and rebalance the pool.

johnsanc · October 13, 2019

OK so my backup is complete and I reseated the cables. Everything booted up fine. Before the array was started I went to add disk2 back to its rightful slot... now it says disk2 is part of the cache pool and disk1 (which was previously good) is now unmountable with no filesystem. From the web GUI it looks like my only option is to format disk1. Is this correct? I would think that disk2 would need to be formatted and re-added... Doesn't make sense to me. If I try to remove disk2 again I get errors saying that disk2 is missing now and I cannot change the number of slots to 1.

I guess my worries aren't completely unfounded.

The only thing noteworthy I see in the log is this:

Oct 13 16:58:39 Tower emhttpd: shcmd (212): mount -t btrfs -o noatime,nodiratime,degraded -U ec85aae9-18b3-4066-9749-8195b3bee6e8 /mnt/cache
Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): allowing degraded mounts
Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): disk space caching is enabled
Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): has skinny extents
Oct 13 16:58:39 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing
Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): bdev /dev/sdr1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): bdev /dev/sds1 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
Oct 13 16:58:39 Tower kernel: BTRFS warning (device sdr1): chunk 14780647735296 missing 1 devices, max tolerance is 0 for writeable mount
Oct 13 16:58:39 Tower kernel: BTRFS warning (device sdr1): writeable mount is not allowed due to too many missing devices
Oct 13 16:58:39 Tower emhttpd: /mnt/cache mount error: No file system
Oct 13 16:58:39 Tower kernel: BTRFS error (device sdr1): open_ctree failed

Edited October 13, 2019 by johnsanc

JonathanM · October 13, 2019

Try removing both cache disks from the pool and starting the array, then stop the array and assign both disks at at the same time.

johnsanc · October 13, 2019

Well since I had a backup of the cache I just went ahead and formatted. No surprise that everything in the pool was wiped out after doing that.

Everything appeared fine on the surface now, clean slate. So then I went to copy things back to the cache... after awhile I decided to check the cache settings and I noticed this:

Why is only the Data in RAID1?

Data, RAID1: total=205.00GiB, used=204.18GiB
System, single: total=4.00MiB, used=48.00KiB
Metadata, single: total=1.01GiB, used=588.72MiB
GlobalReserve, single: total=228.83MiB, used=0.00B

John_M · October 13, 2019

One the Home page, click on the "cache" device to open its window. Scroll down to the Balance Status section and to the right of the Balance button enter the following

-dconvert=raid1 -mconvert=raid1

then click the Balance button.

johnsanc · October 13, 2019

Thanks - running that now... I first tried it without that because the help content on the page said thats what is used by default. I guess thats not the case in all scenarios. Now does this look like it's going to be correct when it's done?

Data, RAID1: total=204.00GiB, used=203.17GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=1.00GiB, used=256.00KiB
Metadata, single: total=1.00GiB, used=568.91MiB
GlobalReserve, single: total=228.12MiB, used=21.38MiB

EDIT: just refreshed and I think this is looking better:

Data, RAID1: total=205.00GiB, used=204.17GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=1.00GiB, used=588.12MiB
GlobalReserve, single: total=228.27MiB, used=0.00B

Edited October 13, 2019 by johnsanc

John_M · October 13, 2019

That looks good now. The view you had mid-balance showed that some of the metadata had been converted to RAID1, while some remained as single, pending conversion.

JorgeB · October 14, 2019

7 hours ago, johnsanc said:

Why is only the Data in RAID1?

It's a bug.

Make sure for the future you monitor the pool for errors, to cache a dropped device sooner.

johnsanc · October 14, 2019

@johnnie.black - THANK YOU! I'm glad the RAID1 bug has been logged and hopefully there is a fix soon so people don't think they are protected when they really aren't. If it weren't for digging through the forums or responses like yours people would never know.

Also thanks for the monitoring tip. I added this script just now. Great idea!

JorgeB · October 15, 2019

10 hours ago, johnsanc said:

and hopefully there is a fix soon

It's already been fixed on v6.8rc1

Kosmos · July 14, 2021

Greetings everyone,

I do have a similar problem, but there is a key difference: Only after rebooting the server, my cache pool device can not be reassigned.

So my server starts throwing this error message every minute: Cache pool BTRFS missing device(s).

Then of course the "pool" continues on a single drive only. Logs below.

Consequently, I can not run the balance command to raid1.

The pool size is displayed as the sum of both drives, which should not be the case, free space is displayed correctly, though.

When setting up the pool (again) everything works fine with raid1 and r/w access.

Curiously, the missing device is displayed as the one continuing to work. But maybe thats just for identification of the respective pool.

When I'm using the SSDs individually, they work as intended.

They are even attached to the same SATA controller.

I'm happy to get your opinions.

Best regards

UNRAID Version: 6.9.2

Quote

Jul 14 10:32:54 Server emhttpd: shcmd (1418): mkdir -p /mnt/ssd2cache
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache uuid: 45e569e9-eb23-4a71-a966-df286a5e680a
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache TotDevices: 2
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumDevices: 2
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumFound: 1
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumMissing: 1
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumMisplaced: 0
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumExtra: 1
Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache LuksState: 1
Jul 14 10:32:54 Server emhttpd: shcmd (1419): mount -t btrfs -o noatime,space_cache=v2,discard=async,degraded -U 45e569e9-eb23-4a71-a966-df286a5e680a /mnt/ssd2cache
Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): turning on async discard
Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): allowing degraded mounts
Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): using free space tree
Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): has skinny extents
Jul 14 10:32:54 Server kernel: BTRFS warning (device dm-8): devid 2 uuid aae9367d-7288-498d-a637-35cbb816dd99 is missing
Jul 14 10:32:54 Server kernel: BTRFS warning (device dm-8): devid 2 uuid aae9367d-7288-498d-a637-35cbb816dd99 is missing
Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): enabling ssd optimizations
Jul 14 10:32:54 Server emhttpd: shcmd (1420): /sbin/btrfs device delete missing /mnt/ssd2cache && /sbin/btrfs balance start --full-balance /mnt/ssd2cache &

[...]

Jul 14 12:28:32 Server ool www[26914]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_balance 'start' '/mnt/ssd2cache' '-dconvert=raid1,soft -mconvert=raid1,soft'
Jul 14 12:28:32 Server kernel: BTRFS error (device dm-8): balance: invalid convert data profile raid1

server-diagnostics-20210714-1540.zip

Edited July 14, 2021 by Kosmos
uploaded diagnostics

JorgeB · July 14, 2021

Please post the diagnostics: Tools -> Diagnostics

Kosmos · July 14, 2021

2 hours ago, JorgeB said:

Please post the diagnostics: Tools -> Diagnostics

my bad, you can find them attached to the previous post

Rebooted, lost cache disk, BTRFS operation running?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation