fusselnerd Posted July 2 Share Posted July 2 Hello community, I need some help or suggestions how to rebuild/repair an Array with two disks seemingly having a corrupted BTRFS file system. Setup Unraid Array (BTRFS) 2 x Parity Disk 18GB 6 x Array Disk 8GB Current State Parity: OK Disks 2, 4, 5, 6: OK Disk 1 and Disk 3: not mountable, BTRFS Error: superblock checksum mismatch... Array: unsafe, missing disks are emulated What happend For some time different disks got missing repeatedly. When that happend I managed to repair the array either because I got the missing disk running again or by clearing the disk and rebuilding the array. When two Array disks were missing at the same time I had enough and stopped all containers to search for the root cause. Because the missing disks were emulated I made a full backup with rsync first, which went well after the second try. Then I figured out that the main issue was the 10 Port PCIe SATA controller. So I switched to a proper SAS controller. The two missing disks are back, boot log doesn't throw any errors. So far so good. Then I assigned the missing disks back to the array, started the array and started the rebuild. Only after a couple of minutes I saw the warning that the disks ar unmountable. My mistake... So I stopped the rebuild/sync. Issue [...] Jul 2 12:22:13 GrayBigBerta emhttpd: Mounting disks... Jul 2 12:22:13 GrayBigBerta emhttpd: mounting /mnt/disk1 Jul 2 12:22:13 GrayBigBerta emhttpd: shcmd (1960): mkdir -p /mnt/disk1 Jul 2 12:22:13 GrayBigBerta emhttpd: /mnt/disk1: no btrfs or device /dev/md1p1 is not single Jul 2 12:22:13 GrayBigBerta emhttpd: /mnt/disk1 mount error: Unsupported or no file system Jul 2 12:22:13 GrayBigBerta emhttpd: shcmd (1961): rmdir /mnt/disk1 Jul 2 12:22:13 GrayBigBerta emhttpd: mounting /mnt/disk2 Jul 2 12:22:13 GrayBigBerta emhttpd: shcmd (1962): mkdir -p /mnt/disk2 Jul 2 12:22:14 GrayBigBerta emhttpd: shcmd (1963): mount -t btrfs -o noatime,space_cache=v2 /dev/md2p1 /mnt/disk2 Jul 2 12:22:14 GrayBigBerta kernel: BTRFS info (device md2p1): using crc32c (crc32c-intel) checksum algorithm Jul 2 12:22:14 GrayBigBerta kernel: BTRFS info (device md2p1): using free space tree Jul 2 12:22:15 GrayBigBerta kernel: BTRFS info (device md2p1): bdev /dev/md2p1 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0 Jul 2 12:22:25 GrayBigBerta emhttpd: shcmd (1964): btrfs filesystem resize 1:max /mnt/disk2 Jul 2 12:22:25 GrayBigBerta root: Resize device id 1 (/dev/md2p1) from 7.28TiB to max Jul 2 12:22:25 GrayBigBerta kernel: BTRFS info (device md2p1): resizing devid 1 Jul 2 12:22:25 GrayBigBerta emhttpd: mounting /mnt/disk3 Jul 2 12:22:25 GrayBigBerta emhttpd: shcmd (1965): mkdir -p /mnt/disk3 Jul 2 12:22:26 GrayBigBerta emhttpd: /mnt/disk3: no btrfs or device /dev/md3p1 is not single Jul 2 12:22:26 GrayBigBerta emhttpd: /mnt/disk3 mount error: Unsupported or no file system Jul 2 12:22:26 GrayBigBerta emhttpd: shcmd (1966): rmdir /mnt/disk3 [...] Trying to mount the disks as unassigned devices: Jul 2 12:38:30 GrayBigBerta unassigned.devices: Mounting partition 'sdg1' at mountpoint '/mnt/disks/VRJW879K'... Jul 2 12:38:30 GrayBigBerta unassigned.devices: Mount cmd: /sbin/mount -t 'btrfs' -o rw,relatime,space_cache=v2 '/dev/sdg1' '/mnt/disks/VRJW879K' Jul 2 12:38:30 GrayBigBerta kernel: BTRFS: device fsid d92a06ea-1eb0-4fd1-8aa3-47e0d921bdd8 devid 1 transid 29956 /dev/sdg1 scanned by mount (4220) Jul 2 12:38:30 GrayBigBerta kernel: BTRFS info (device sdg1): using crc32c (crc32c-intel) checksum algorithm Jul 2 12:38:30 GrayBigBerta kernel: BTRFS error (device sdg1): superblock checksum mismatch Jul 2 12:38:30 GrayBigBerta kernel: BTRFS error (device sdg1): open_ctree failed Jul 2 12:38:32 GrayBigBerta unassigned.devices: Mount of 'sdg1' failed: 'mount: /mnt/disks/VRJW879K: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. ' Jul 2 12:38:32 GrayBigBerta unassigned.devices: Partition 'VRJW879K' cannot be mounted. My own idea Because the array is still functional due to the emulated disks I had the idea to simply format the two disks and assign them as "new" disks to the array. I did that with single disks before but not with two at the same time, so I'm not sure if that is a good idea. Is ther a way to repair the filesystem? If yes, then I'd kindly ask for help. Note: the rebuild was running for ten minutes or so before I canceled it, don't know if this makes any difference. Or should I go with my idea? Thank you very much in advance! graybigberta-diagnostics-20240702-1251.zip Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 24 minutes ago, fusselnerd said: Because the array is still functional due to the emulated disks I had the idea to simply format the two disks and assign them as "new" disks to the array. Don't do this, formatting disks is never a solution when trying to recover data. Post the output of btrfs fi show Quote Link to comment
fusselnerd Posted July 2 Author Share Posted July 2 Thank you for your fast response. 3 minutes ago, JorgeB said: Don't do this, formatting disks is never a solution when trying to recover data. Ok, I keep that in mind. Here's the requested output: root@GrayBigBerta:~# btrfs fi show ERROR: superblock checksum mismatch ERROR: cannot scan /dev/sdb1: Input/output error ERROR: superblock checksum mismatch ERROR: cannot scan /dev/sdg1: Input/output error Label: none uuid: af3b39b5-c791-4ea4-880d-fc1ad26cfc2d Total devices 1 FS bytes used 3.64TiB devid 1 size 7.28TiB used 3.72TiB path /dev/sdf1 Label: none uuid: 39a4d42e-8ae1-436c-ae74-488cb24183bb Total devices 1 FS bytes used 685.45MiB devid 1 size 465.76GiB used 4.02GiB path /dev/nvme0n1p1 Label: none uuid: ffae4078-e89d-4329-b2b9-bdd13773a8ec Total devices 1 FS bytes used 3.64TiB devid 1 size 7.28TiB used 3.71TiB path /dev/sdd1 Label: none uuid: c40c0298-85ed-4130-aa76-cdacac9ccfa5 Total devices 1 FS bytes used 120.40GiB devid 1 size 465.76GiB used 177.02GiB path /dev/sdm1 Label: none uuid: 5f5f56e8-f435-4b81-9042-8cccd1fb7f8e Total devices 1 FS bytes used 76.42GiB devid 2 size 223.58GiB used 78.03GiB path /dev/sdi1 Label: none uuid: d326d8d7-9da5-4d55-b3ad-43541260b369 Total devices 1 FS bytes used 144.00KiB devid 1 size 931.51GiB used 3.02GiB path /dev/nvme2n1p1 Label: none uuid: cf55c94a-4fd3-4030-a415-1d96a475aa3c Total devices 1 FS bytes used 5.40TiB devid 1 size 7.28TiB used 5.47TiB path /dev/sde1 Label: none uuid: a93250e3-43bc-41c9-adbc-76ac0b3b0b16 Total devices 1 FS bytes used 46.88MiB devid 1 size 111.79GiB used 3.02GiB path /dev/sdl1 Label: none uuid: 2e238485-e144-4d1f-aa1a-13097d3a3e99 Total devices 1 FS bytes used 66.90GiB devid 1 size 232.88GiB used 83.02GiB path /dev/nvme1n1p1 Label: none uuid: 05612964-8ba0-475a-b544-e716f5a03167 Total devices 1 FS bytes used 196.00KiB devid 1 size 465.76GiB used 5.02GiB path /dev/sdj1 Label: none uuid: b43c9020-8d0b-4e0e-a3c3-39ec11f9e096 Total devices 1 FS bytes used 3.64TiB devid 1 size 7.28TiB used 3.70TiB path /dev/sdh1 Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 With the array stopped, type in the CLI: echo 1 > /sys/block/sdb/device/delete Wait 5 seconds and again output of: btrfs fi show Quote Link to comment
fusselnerd Posted July 2 Author Share Posted July 2 Here's the output: root@GrayBigBerta:~# echo 1 > /sys/block/sdb/device/delete root@GrayBigBerta:~# btrfs fi show ERROR: superblock checksum mismatch ERROR: cannot scan /dev/sdg1: Input/output error Label: none uuid: af3b39b5-c791-4ea4-880d-fc1ad26cfc2d Total devices 1 FS bytes used 3.64TiB devid 1 size 7.28TiB used 3.72TiB path /dev/sdf1 Label: none uuid: 39a4d42e-8ae1-436c-ae74-488cb24183bb Total devices 1 FS bytes used 685.45MiB devid 1 size 465.76GiB used 4.02GiB path /dev/nvme0n1p1 Label: none uuid: ffae4078-e89d-4329-b2b9-bdd13773a8ec Total devices 1 FS bytes used 3.64TiB devid 1 size 7.28TiB used 3.71TiB path /dev/sdd1 Label: none uuid: c40c0298-85ed-4130-aa76-cdacac9ccfa5 Total devices 1 FS bytes used 120.40GiB devid 1 size 465.76GiB used 177.02GiB path /dev/sdm1 Label: none uuid: 5f5f56e8-f435-4b81-9042-8cccd1fb7f8e Total devices 1 FS bytes used 76.42GiB devid 2 size 223.58GiB used 78.03GiB path /dev/sdi1 Label: none uuid: d326d8d7-9da5-4d55-b3ad-43541260b369 Total devices 1 FS bytes used 144.00KiB devid 1 size 931.51GiB used 3.02GiB path /dev/nvme2n1p1 Label: none uuid: cf55c94a-4fd3-4030-a415-1d96a475aa3c Total devices 1 FS bytes used 5.40TiB devid 1 size 7.28TiB used 5.47TiB path /dev/sde1 Label: none uuid: a93250e3-43bc-41c9-adbc-76ac0b3b0b16 Total devices 1 FS bytes used 46.88MiB devid 1 size 111.79GiB used 3.02GiB path /dev/sdl1 Label: none uuid: 2e238485-e144-4d1f-aa1a-13097d3a3e99 Total devices 1 FS bytes used 66.90GiB devid 1 size 232.88GiB used 83.02GiB path /dev/nvme1n1p1 Label: none uuid: 05612964-8ba0-475a-b544-e716f5a03167 Total devices 1 FS bytes used 196.00KiB devid 1 size 465.76GiB used 5.02GiB path /dev/sdj1 Label: none uuid: b43c9020-8d0b-4e0e-a3c3-39ec11f9e096 Total devices 1 FS bytes used 3.64TiB devid 1 size 7.28TiB used 3.70TiB path /dev/sdh1 Quote Link to comment
JorgeB Posted July 3 Share Posted July 3 Sorry, missed your last post, reboot to bring the other device back online and post new diags before array start. Quote Link to comment
fusselnerd Posted July 3 Author Share Posted July 3 No problem diags attached and here's the output from btrfs fi show (I skipped the other drives): root@GrayBigBerta:~# btrfs fi show ERROR: superblock checksum mismatch ERROR: cannot scan /dev/sdb1: Input/output error ERROR: superblock checksum mismatch ERROR: cannot scan /dev/sdg1: Input/output error [...] graybigberta-diagnostics-20240703-1405.zip Quote Link to comment
JorgeB Posted July 3 Share Posted July 3 Start the array with both disks unassigned and post new diags. If I understood correctly you have a backup of both disks? Quote Link to comment
fusselnerd Posted July 3 Author Share Posted July 3 Diags after starting array attached. 10 minutes ago, JorgeB said: If I understood correctly you have a backup of both disks? Kind of... I have a backup of most unraid shares and its content. So the content of both disks is included. Note: I made the backup from the emulated fs (gladly there are two parity disks...). Note 2: Shares are split automatically on directory level (High-water, standard configuration). So Sadyl, I don't have a copy or clone of the disks themself, if that's what you mean. graybigberta-diagnostics-20240703-1458.zip Quote Link to comment
fusselnerd Posted July 3 Author Share Posted July 3 On a second look, the missing disks don't appear as locations in the shares anymore. Seems like the data is "lost" from the array. I guess that happend when the Unraid started to rebuild the array automatically a couple of days ago... Quote Link to comment
JorgeB Posted July 3 Share Posted July 3 Both emulated disks are not mounting, are you sure if they were mounting or not when you did the backup? If they weren't mounting at the time, and that would be the most like, not data would be copied from them. This error is kind of strange but I think it may not be recoverable, I also see data corruption being detected in multiple disks, so you may have bad RAM, which could or not be related to the current problem: Jul 3 14:58:54 GrayBigBerta kernel: BTRFS info (device md2p1): bdev /dev/md2p1 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0 Jul 3 14:59:05 GrayBigBerta kernel: BTRFS info (device md4p1): bdev /dev/md4p1 errs: wr 359, rd 1, flush 0, corrupt 71, gen 0 Jul 3 14:59:09 GrayBigBerta kernel: BTRFS info (device md5p1): bdev /dev/md5p1 errs: wr 0, rd 0, flush 0, corrupt 287, gen 0 Jul 3 14:59:12 GrayBigBerta kernel: BTRFS info (device md6p1): bdev /dev/md6p1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0 First thing I would recommend is to run memtest for at least a couple of passes, you will also need to scrub all those disks, but that can be for later, run memtest now and post back the results, though keep in mind that memtest is only definite if errors are found. Quote Link to comment
fusselnerd Posted July 3 Author Share Posted July 3 Thank you, memtest v7 is running now. I'll post the results after a couple of passes. Quote Link to comment
fusselnerd Posted July 3 Author Share Posted July 3 38 minutes ago, JorgeB said: If they weren't mounting at the time, and that would be the most like, not data would be copied from them. No they weren't. But this confuses me. Maybe you could help me understand in the meantime... From my understanding the Unraid array parity can buffer disk failures, similar to a raid parity: If I have an array with one parity and one array disk is failing for whatever reason, then the data on this failed disk is calculated from the parity and the remaining disks. So as long as there is not a second disk failing the data should be available (emulated disk). Same should apply to two parity disks and max of two array disks failing (which is the case at my setup). Do I fundamantaly misunderstand something here? Quote Link to comment
JorgeB Posted July 3 Share Posted July 3 54 minutes ago, fusselnerd said: From my understanding the Unraid array parity can buffer disk failures, similar to a raid parity: Parity help if a disk fails, but if together with that there's filesystem corruption, it cannot help with that part. Quote Link to comment
fusselnerd Posted July 3 Author Share Posted July 3 So the whole btrfs filesystem of the array has a problem... ok, then that makes sence. Thank you for clarifying! btw. 2 memtest passes so far, no errors... Quote Link to comment
fusselnerd Posted July 4 Author Share Posted July 4 Hi JorgeB, memtest86 results attached. No errors after 4 passes. Quote Link to comment
JorgeB Posted July 4 Share Posted July 4 Keep in mind that memtest is only definitive if if finds errors, but for now, lets make a final try to recover the data, finish rebuilding both disks, we can then try to use a backup superblock to see if that works, but don't have much hope, if that doesn't work, you can then try using a file recovery app like UFS explorer on the rebuilt disks. You will also need to scrub all the disks that have corruption detected, then reset the errors and monitor to see if new ones come up, but this is for later. Quote Link to comment
fusselnerd Posted July 4 Author Share Posted July 4 (edited) Ah, now I understand what you meant with "definitive" - letting memtest run until it finds an error. Sorry, english is not my first language 😅 Ok, I assigned both disks and started the array. Rebuild started automatically. But both disks are labled as "Unmountable: Unsupported or no file system". I read in another post that in this case, the rebuild will not actually write anything on the disks. I paused the rebuild for now. Shall I proceed? Edited July 4 by fusselnerd Quote Link to comment
JorgeB Posted July 4 Share Posted July 4 It will still write to the disks, BTW, I forgot to ask, those disks are the original disks 1 and 3 right? Or are they new and you still have the old ones? Quote Link to comment
fusselnerd Posted July 4 Author Share Posted July 4 Just now, JorgeB said: It will still write to the disks Ok, thx. Rebuild is resuming. Just now, JorgeB said: those disks are the original disks 1 and 3 right? Or are they new and you still have the old ones? They are the original ones in the original order. Quote Link to comment
JorgeB Posted July 4 Share Posted July 4 OK, if this happens in the future never start rebuilding on top of the old disk if the emulated disk doesn't mount, if you hadn't tried that, the original disks could still be OK, rebuilding an unmountable disk will always result in an unmountable disk, but now there's no other option, and once they are rebuilt, we can see if the backup superblock helps, if it doesn't, you can run UFS explorer on them, that cannot be run on emulated disks. Quote Link to comment
fusselnerd Posted July 5 Author Share Posted July 5 (edited) Hi @JorgeB Rebuild is complete. Next step is On 7/4/2024 at 12:49 PM, JorgeB said: try to use a backup superblock right? Could you please guide me through the process? On 7/4/2024 at 1:27 PM, JorgeB said: if this happens in the future never start rebuilding on top of the old disk if the emulated disk doesn't mount, if you hadn't tried that, the original disks could still be OK, rebuilding an unmountable disk will always result in an unmountable disk Got it and I keep it in mind. I went through the Unraid docs again, it's mentioned there several times. Lesson learned the hard way... The only issue I have is, that I couldn't see if the drives are mountable before starting the array. And starting the array will automatically trigger the rebuild (though I might remember it wrong). So in the future, I will test a temporary failed disk before e.g. by mounting it separatly, before starting the array, if such a situation ever happens again. And of course, keep an eye open for fs errors. But maybe this is a topic for another discussion. Edited July 5 by fusselnerd Quote Link to comment
JorgeB Posted July 5 Share Posted July 5 1 hour ago, fusselnerd said: The only issue I have is, that I couldn't see if the drives are mountable before starting the array. And starting the array will automatically trigger the rebuild After the disks get disabled they won't start rebuilding automatically, if they were unassigned, you can start the array without the disks assigned to see if the emulated disks are working. Start the array in maintenance mode and post the output of: btrfs-select-super -s 1 /dev/md1p1 and btrfs-select-super -s 1 /dev/md3p1 Quote Link to comment
fusselnerd Posted July 5 Author Share Posted July 5 Here we go: root@GrayBigBerta:~# btrfs-select-super -s 1 /dev/md1p1 ERROR: superblock checksum mismatch ERROR: superblock checksum mismatch No valid Btrfs found on /dev/md1p1 ERROR: open ctree failed root@GrayBigBerta:~# btrfs-select-super -s 1 /dev/md3p1 ERROR: superblock checksum mismatch ERROR: superblock checksum mismatch No valid Btrfs found on /dev/md3p1 ERROR: open ctree failed Quote Link to comment
fusselnerd Posted July 5 Author Share Posted July 5 6 minutes ago, JorgeB said: After the disks get disabled they won't start rebuilding automatically, if they were unassigned, you can start the array without the disks assigned to see if the emulated disks are working. I see, thank you! Can you recommend a read about btrfs and Unraid array? I'm missing the fundamentals, obviously, so I'd like to dig into it a bit. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.