johnsanc Posted October 13, 2019 Share Posted October 13, 2019 (edited) I'm not really sure what happened... I woke up this morning and most of my dockers and VMs were disabled, I had what looked like some I/O errors with one of my cache drives in the cache pool. I rebooted and when the system came back up I only see one cache drive now and there is a BTRFS operation running, disk log below. I can also see on my cache disk tab that there is only one disk and the size still shows as 2TB but it appears to be freeing up space and reducing used space... I'm hoping this is just rebalancing back to one disk or something but I cannot clearly tell from the UI what is going on. What exactly is happening here and how do I get my cache pool back to normal? I assumed I should just let this run instead of trying to stop it. I fear I will lose all my cache data if I try to interfere. Oct 13 10:55:36 Tower kernel: ata7: SATA max UDMA/133 abar m512@0xecf00000 port 0xecf00100 irq 48 Oct 13 10:55:36 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 13 10:55:36 Tower kernel: ata7.00: supports DRM functions and may not be fully accessible Oct 13 10:55:36 Tower kernel: ata7.00: ATA-11: Samsung SSD 860 EVO 1TB, [REDACTED], [REDACTED], max UDMA/133 Oct 13 10:55:36 Tower kernel: ata7.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA Oct 13 10:55:36 Tower kernel: ata7.00: supports DRM functions and may not be fully accessible Oct 13 10:55:36 Tower kernel: ata7.00: configured for UDMA/133 Oct 13 10:55:36 Tower kernel: ata7.00: Enabling discard_zeroes_data Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB) Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Write Protect is off Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Mode Sense: 00 3a 00 00 Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Oct 13 10:55:36 Tower kernel: ata7.00: Enabling discard_zeroes_data Oct 13 10:55:36 Tower kernel: sdr: sdr1 Oct 13 10:55:36 Tower kernel: ata7.00: Enabling discard_zeroes_data Oct 13 10:55:36 Tower kernel: sd 7:0:0:0: [sdr] Attached SCSI disk Oct 13 10:55:36 Tower kernel: BTRFS: device fsid ec85aae9-18b3-4066-9749-8195b3bee6e8 devid 1 transid 14803447 /dev/sdr1 Oct 13 10:55:52 Tower emhttpd: Samsung_SSD_860_EVO_1TB_[REDACTED] (sdr) 512 1953525168 Oct 13 10:55:52 Tower emhttpd: import 30 cache device: (sdr) Samsung_SSD_860_EVO_1TB_[REDACTED] Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): allowing degraded mounts Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): disk space caching is enabled Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): has skinny extents Oct 13 10:56:02 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing Oct 13 10:56:02 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): bdev (null) errs: wr 1, rd 2, flush 0, corrupt 0, gen 0 Oct 13 10:56:02 Tower kernel: BTRFS info (device sdr1): bdev /dev/sdr1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): enabling ssd optimizations Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): resizing devid 1 Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): new size for /dev/sdr1 is 1000204853248 Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): relocating block group 14118149029888 flags data Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): relocating block group 14117008179200 flags data|raid1 Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): found 1 extents Oct 13 10:56:03 Tower kernel: BTRFS info (device sdr1): relocating block group 14112230211584 flags data|raid1 etc... Edited October 13, 2019 by johnsanc Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 (edited) Thanks - it just finished rebalancing and it did not do what I assumed it would do... Instead of balancing to a single drive now it looks like I have a single "Cache" which is twice the size of my physical drive. If I shut down the array I don't see my other disk even available for slot2. How can the size of my cache be twice the size? Why does unraid AUTOMATICALLY rebalance when the array starts in this scenario? That seems like an action that the user should have to initiate. Now when I stop the array and try to restart it I cannot even restart. My only option is to reboot or power down. I get a warning at the bottom saying I have a stale configuration. Edited October 13, 2019 by johnsanc Quote Link to comment
jpowell8672 Posted October 13, 2019 Share Posted October 13, 2019 What was your cache pool raid1, raid0, etc.? Did you have your cache backed up? What kind of drives are you using for cache and how are they connected? Could be a bad cable or drive failure. Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 OK I rebooted and now my other drive is available to add to cache, but I get a warning that says "All existing data on this device will be OVERWRITTEN when array is Started". If I were to add my disk back would it rebalance back to the original state? I use two 1TB Samsung EVO SSDs These were in RAID1, mirrored I never did anything to change the RAID setting, whatever happened happened automatically when I rebooted the first time. Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 OK - well I did not add a 2nd disk, I started the Array to backup the disk but now its rebalancing AGAIN... what is going on?! Quote Link to comment
jpowell8672 Posted October 13, 2019 Share Posted October 13, 2019 https://wiki.unraid.net/Replace_A_Cache_Drive Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 (edited) So basically back everything up to the array and start over? Also, is there an explanation for what happened in my case? Why did the cache size appear twice as large as the single physical disk remaining? Edited October 13, 2019 by johnsanc Quote Link to comment
John_M Posted October 13, 2019 Share Posted October 13, 2019 (edited) 1 hour ago, johnsanc said: Also, is there an explanation for what happened in my case? One of the disks in your cache pool dropped offline: Oct 13 10:56:02 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing probably due to a cable problem. You ought to have grabbed diagnostics at that point (Tools -> Diagnostics) and posted the resulting zip file. Syslog fragments are of very limited use - in this case it showed what happened, but not why. When Unraid notices a disk missing from the cache pool it will rebalance the pool to protect the data on the remaining disk(s) in the pool as much as it can. If I understand correctly, the missing disk has now reappeared after a reboot and you want to add it back to the pool? Before you do, I recommend shutting down, checking its cables, then rebooting and grabbing diagnostics. If the disk is properly recognised and in good health then you can re-allocate it to its former slot and it will be reformatted and the pool will be rebalanced again. When it has completed rebalancing everything should be as it was before. That's the meaning of the message that the disk will be overwritten - it will be overwritten with data from the remaining healthy members of the pool, which is exactly what you want. Edited October 13, 2019 by John_M typos Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 Yes thats what I thought, but I was bitten before where the entire pool was overwritten instead of just the disk so I am extra cautious. Right now I'm backing everything up to my array just in case something goes south when trying to rebuild my pool. Any explanation as to why it showed my cache as 2TB even though I only had a single 1TB disk at the time? That part I really don't understand. Quote Link to comment
John_M Posted October 13, 2019 Share Posted October 13, 2019 2 minutes ago, johnsanc said: Any explanation as to why it showed my cache as 2TB even though I only had a single 1TB disk at the time? I recall it being mentioned somewhere before and being confirming as a bug. If you're worried that the healthy members of your cache pool might be overwritten by the bad data on your once missing, now returned disk you can wipe it. You should see it under the Unassigned devices heading on the Main page. Make a note of its device name, sdX. In a terminal window type wipefs /dev/sdX and double, triple check before you press RETURN that you're wiping the correct disk. Then reassign it to its rightful slot in the cache pool and let the system format it and rebalance the pool. Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 (edited) OK so my backup is complete and I reseated the cables. Everything booted up fine. Before the array was started I went to add disk2 back to its rightful slot... now it says disk2 is part of the cache pool and disk1 (which was previously good) is now unmountable with no filesystem. From the web GUI it looks like my only option is to format disk1. Is this correct? I would think that disk2 would need to be formatted and re-added... Doesn't make sense to me. If I try to remove disk2 again I get errors saying that disk2 is missing now and I cannot change the number of slots to 1. I guess my worries aren't completely unfounded. The only thing noteworthy I see in the log is this: Oct 13 16:58:39 Tower emhttpd: shcmd (212): mount -t btrfs -o noatime,nodiratime,degraded -U ec85aae9-18b3-4066-9749-8195b3bee6e8 /mnt/cache Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): allowing degraded mounts Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): disk space caching is enabled Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): has skinny extents Oct 13 16:58:39 Tower kernel: BTRFS warning (device sdr1): devid 3 uuid 1d8f16d9-c1d4-4ba2-8a87-729545783443 is missing Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): bdev /dev/sdr1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 Oct 13 16:58:39 Tower kernel: BTRFS info (device sdr1): bdev /dev/sds1 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0 Oct 13 16:58:39 Tower kernel: BTRFS warning (device sdr1): chunk 14780647735296 missing 1 devices, max tolerance is 0 for writeable mount Oct 13 16:58:39 Tower kernel: BTRFS warning (device sdr1): writeable mount is not allowed due to too many missing devices Oct 13 16:58:39 Tower emhttpd: /mnt/cache mount error: No file system Oct 13 16:58:39 Tower kernel: BTRFS error (device sdr1): open_ctree failed Edited October 13, 2019 by johnsanc Quote Link to comment
JonathanM Posted October 13, 2019 Share Posted October 13, 2019 Try removing both cache disks from the pool and starting the array, then stop the array and assign both disks at at the same time. Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 Well since I had a backup of the cache I just went ahead and formatted. No surprise that everything in the pool was wiped out after doing that. Everything appeared fine on the surface now, clean slate. So then I went to copy things back to the cache... after awhile I decided to check the cache settings and I noticed this: Why is only the Data in RAID1? Data, RAID1: total=205.00GiB, used=204.18GiB System, single: total=4.00MiB, used=48.00KiB Metadata, single: total=1.01GiB, used=588.72MiB GlobalReserve, single: total=228.83MiB, used=0.00B Quote Link to comment
John_M Posted October 13, 2019 Share Posted October 13, 2019 One the Home page, click on the "cache" device to open its window. Scroll down to the Balance Status section and to the right of the Balance button enter the following -dconvert=raid1 -mconvert=raid1 then click the Balance button. Quote Link to comment
johnsanc Posted October 13, 2019 Author Share Posted October 13, 2019 (edited) Thanks - running that now... I first tried it without that because the help content on the page said thats what is used by default. I guess thats not the case in all scenarios. Now does this look like it's going to be correct when it's done? Data, RAID1: total=204.00GiB, used=203.17GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=1.00GiB, used=256.00KiB Metadata, single: total=1.00GiB, used=568.91MiB GlobalReserve, single: total=228.12MiB, used=21.38MiB EDIT: just refreshed and I think this is looking better: Data, RAID1: total=205.00GiB, used=204.17GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=1.00GiB, used=588.12MiB GlobalReserve, single: total=228.27MiB, used=0.00B Edited October 13, 2019 by johnsanc Quote Link to comment
John_M Posted October 13, 2019 Share Posted October 13, 2019 That looks good now. The view you had mid-balance showed that some of the metadata had been converted to RAID1, while some remained as single, pending conversion. Quote Link to comment
JorgeB Posted October 14, 2019 Share Posted October 14, 2019 7 hours ago, johnsanc said: Why is only the Data in RAID1? It's a bug. Make sure for the future you monitor the pool for errors, to cache a dropped device sooner. Quote Link to comment
johnsanc Posted October 14, 2019 Author Share Posted October 14, 2019 @johnnie.black - THANK YOU! I'm glad the RAID1 bug has been logged and hopefully there is a fix soon so people don't think they are protected when they really aren't. If it weren't for digging through the forums or responses like yours people would never know. Also thanks for the monitoring tip. I added this script just now. Great idea! Quote Link to comment
JorgeB Posted October 15, 2019 Share Posted October 15, 2019 10 hours ago, johnsanc said: and hopefully there is a fix soon It's already been fixed on v6.8rc1 Quote Link to comment
Kosmos Posted July 14, 2021 Share Posted July 14, 2021 (edited) Greetings everyone, I do have a similar problem, but there is a key difference: Only after rebooting the server, my cache pool device can not be reassigned. So my server starts throwing this error message every minute: Cache pool BTRFS missing device(s). Then of course the "pool" continues on a single drive only. Logs below. Consequently, I can not run the balance command to raid1. The pool size is displayed as the sum of both drives, which should not be the case, free space is displayed correctly, though. When setting up the pool (again) everything works fine with raid1 and r/w access. Curiously, the missing device is displayed as the one continuing to work. But maybe thats just for identification of the respective pool. When I'm using the SSDs individually, they work as intended. They are even attached to the same SATA controller. I'm happy to get your opinions. Best regards UNRAID Version: 6.9.2 Quote Jul 14 10:32:54 Server emhttpd: shcmd (1418): mkdir -p /mnt/ssd2cache Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache uuid: 45e569e9-eb23-4a71-a966-df286a5e680a Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache TotDevices: 2 Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumDevices: 2 Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumFound: 1 Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumMissing: 1 Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumMisplaced: 0 Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache NumExtra: 1 Jul 14 10:32:54 Server emhttpd: /mnt/ssd2cache LuksState: 1 Jul 14 10:32:54 Server emhttpd: shcmd (1419): mount -t btrfs -o noatime,space_cache=v2,discard=async,degraded -U 45e569e9-eb23-4a71-a966-df286a5e680a /mnt/ssd2cache Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): turning on async discard Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): allowing degraded mounts Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): using free space tree Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): has skinny extents Jul 14 10:32:54 Server kernel: BTRFS warning (device dm-8): devid 2 uuid aae9367d-7288-498d-a637-35cbb816dd99 is missing Jul 14 10:32:54 Server kernel: BTRFS warning (device dm-8): devid 2 uuid aae9367d-7288-498d-a637-35cbb816dd99 is missing Jul 14 10:32:54 Server kernel: BTRFS info (device dm-8): enabling ssd optimizations Jul 14 10:32:54 Server emhttpd: shcmd (1420): /sbin/btrfs device delete missing /mnt/ssd2cache && /sbin/btrfs balance start --full-balance /mnt/ssd2cache & [...] Jul 14 12:28:32 Server ool www[26914]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_balance 'start' '/mnt/ssd2cache' '-dconvert=raid1,soft -mconvert=raid1,soft' Jul 14 12:28:32 Server kernel: BTRFS error (device dm-8): balance: invalid convert data profile raid1 server-diagnostics-20210714-1540.zip Edited July 14, 2021 by Kosmos uploaded diagnostics Quote Link to comment
JorgeB Posted July 14, 2021 Share Posted July 14, 2021 Please post the diagnostics: Tools -> Diagnostics Quote Link to comment
Kosmos Posted July 14, 2021 Share Posted July 14, 2021 2 hours ago, JorgeB said: Please post the diagnostics: Tools -> Diagnostics my bad, you can find them attached to the previous post Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.