Jump to content

Need help: rebuilding Array after controller failure - two array disks seem to have a broken BTRFS file system


Recommended Posts

Hello community,

 

I need some help or suggestions how to rebuild/repair an Array with two disks seemingly having a corrupted BTRFS file system.

 

Setup

  • Unraid Array (BTRFS)
  • 2 x Parity Disk 18GB
  • 6 x Array Disk 8GB

 

Current State

  • Parity: OK
  • Disks 2, 4, 5, 6: OK
  • Disk 1 and Disk 3: not mountable, BTRFS Error: superblock checksum mismatch...
  • Array: unsafe, missing disks are emulated

 

What happend

For some time different disks got missing repeatedly. When that happend I managed to repair the array either because I got the missing disk running again or by clearing the disk and rebuilding the array.

When two Array disks were missing at the same time I had enough and stopped all containers to search for the root cause.

Because the missing disks were emulated I made a full backup with rsync first, which went well after the second try.

Then I figured out that the main issue was the 10 Port PCIe SATA controller. So I switched to a proper SAS controller.

The two missing disks are back, boot log doesn't throw any errors. So far so good.

 

Then I assigned the missing disks back to the array, started the array and started the rebuild.

Only after a couple of minutes I saw the warning that the disks ar unmountable. My mistake...

grafik.thumb.png.baf0c6b16eebaa647f6b3b9a0a2b9989.png

 

So I stopped the rebuild/sync.

 

Issue

[...]
Jul  2 12:22:13 GrayBigBerta emhttpd: Mounting disks...
Jul  2 12:22:13 GrayBigBerta emhttpd: mounting /mnt/disk1
Jul  2 12:22:13 GrayBigBerta emhttpd: shcmd (1960): mkdir -p /mnt/disk1
Jul  2 12:22:13 GrayBigBerta emhttpd: /mnt/disk1: no btrfs or device /dev/md1p1 is not single
Jul  2 12:22:13 GrayBigBerta emhttpd: /mnt/disk1 mount error: Unsupported or no file system
Jul  2 12:22:13 GrayBigBerta emhttpd: shcmd (1961): rmdir /mnt/disk1
Jul  2 12:22:13 GrayBigBerta emhttpd: mounting /mnt/disk2
Jul  2 12:22:13 GrayBigBerta emhttpd: shcmd (1962): mkdir -p /mnt/disk2
Jul  2 12:22:14 GrayBigBerta emhttpd: shcmd (1963): mount -t btrfs -o noatime,space_cache=v2 /dev/md2p1 /mnt/disk2
Jul  2 12:22:14 GrayBigBerta kernel: BTRFS info (device md2p1): using crc32c (crc32c-intel) checksum algorithm
Jul  2 12:22:14 GrayBigBerta kernel: BTRFS info (device md2p1): using free space tree
Jul  2 12:22:15 GrayBigBerta kernel: BTRFS info (device md2p1): bdev /dev/md2p1 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0
Jul  2 12:22:25 GrayBigBerta emhttpd: shcmd (1964): btrfs filesystem resize 1:max /mnt/disk2
Jul  2 12:22:25 GrayBigBerta root: Resize device id 1 (/dev/md2p1) from 7.28TiB to max
Jul  2 12:22:25 GrayBigBerta kernel: BTRFS info (device md2p1): resizing devid 1
Jul  2 12:22:25 GrayBigBerta emhttpd: mounting /mnt/disk3
Jul  2 12:22:25 GrayBigBerta emhttpd: shcmd (1965): mkdir -p /mnt/disk3
Jul  2 12:22:26 GrayBigBerta emhttpd: /mnt/disk3: no btrfs or device /dev/md3p1 is not single
Jul  2 12:22:26 GrayBigBerta emhttpd: /mnt/disk3 mount error: Unsupported or no file system
Jul  2 12:22:26 GrayBigBerta emhttpd: shcmd (1966): rmdir /mnt/disk3
[...]

 

Trying to mount the disks as unassigned devices:

Jul  2 12:38:30 GrayBigBerta unassigned.devices: Mounting partition 'sdg1' at mountpoint '/mnt/disks/VRJW879K'...
Jul  2 12:38:30 GrayBigBerta unassigned.devices: Mount cmd: /sbin/mount -t 'btrfs' -o rw,relatime,space_cache=v2 '/dev/sdg1' '/mnt/disks/VRJW879K'
Jul  2 12:38:30 GrayBigBerta kernel: BTRFS: device fsid d92a06ea-1eb0-4fd1-8aa3-47e0d921bdd8 devid 1 transid 29956 /dev/sdg1 scanned by mount (4220)
Jul  2 12:38:30 GrayBigBerta kernel: BTRFS info (device sdg1): using crc32c (crc32c-intel) checksum algorithm
Jul  2 12:38:30 GrayBigBerta kernel: BTRFS error (device sdg1): superblock checksum mismatch
Jul  2 12:38:30 GrayBigBerta kernel: BTRFS error (device sdg1): open_ctree failed
Jul  2 12:38:32 GrayBigBerta unassigned.devices: Mount of 'sdg1' failed: 'mount: /mnt/disks/VRJW879K: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error.        dmesg(1) may have more information after failed mount system call. '
Jul  2 12:38:32 GrayBigBerta unassigned.devices: Partition 'VRJW879K' cannot be mounted.

 

My own idea

Because the array is still functional due to the emulated disks I had the idea to simply format the two disks and assign them as "new" disks to the array.

I did that with single disks before but not with two at the same time, so I'm not sure if that is a good idea.

 

 

Is ther a way to repair the filesystem? If yes, then I'd kindly ask for help.

Note: the rebuild was running for ten minutes or so before I canceled it, don't know if this makes any difference.

 

Or should I go with my idea?

 

 

Thank you very much in advance!

 

 

 

graybigberta-diagnostics-20240702-1251.zip

Link to comment
24 minutes ago, fusselnerd said:

Because the array is still functional due to the emulated disks I had the idea to simply format the two disks and assign them as "new" disks to the array.

Don't do this, formatting disks is never a solution when trying to recover data.

 

Post the output of

btrfs fi show

 

Link to comment

Thank you for your fast response.

 

3 minutes ago, JorgeB said:

Don't do this, formatting disks is never a solution when trying to recover data.

Ok, I keep that in mind.

 

 

Here's the requested output:

root@GrayBigBerta:~# btrfs fi show
ERROR: superblock checksum mismatch
ERROR: cannot scan /dev/sdb1: Input/output error
ERROR: superblock checksum mismatch
ERROR: cannot scan /dev/sdg1: Input/output error
Label: none  uuid: af3b39b5-c791-4ea4-880d-fc1ad26cfc2d
        Total devices 1 FS bytes used 3.64TiB
        devid    1 size 7.28TiB used 3.72TiB path /dev/sdf1

Label: none  uuid: 39a4d42e-8ae1-436c-ae74-488cb24183bb
        Total devices 1 FS bytes used 685.45MiB
        devid    1 size 465.76GiB used 4.02GiB path /dev/nvme0n1p1

Label: none  uuid: ffae4078-e89d-4329-b2b9-bdd13773a8ec
        Total devices 1 FS bytes used 3.64TiB
        devid    1 size 7.28TiB used 3.71TiB path /dev/sdd1

Label: none  uuid: c40c0298-85ed-4130-aa76-cdacac9ccfa5
        Total devices 1 FS bytes used 120.40GiB
        devid    1 size 465.76GiB used 177.02GiB path /dev/sdm1

Label: none  uuid: 5f5f56e8-f435-4b81-9042-8cccd1fb7f8e
        Total devices 1 FS bytes used 76.42GiB
        devid    2 size 223.58GiB used 78.03GiB path /dev/sdi1

Label: none  uuid: d326d8d7-9da5-4d55-b3ad-43541260b369
        Total devices 1 FS bytes used 144.00KiB
        devid    1 size 931.51GiB used 3.02GiB path /dev/nvme2n1p1

Label: none  uuid: cf55c94a-4fd3-4030-a415-1d96a475aa3c
        Total devices 1 FS bytes used 5.40TiB
        devid    1 size 7.28TiB used 5.47TiB path /dev/sde1

Label: none  uuid: a93250e3-43bc-41c9-adbc-76ac0b3b0b16
        Total devices 1 FS bytes used 46.88MiB
        devid    1 size 111.79GiB used 3.02GiB path /dev/sdl1

Label: none  uuid: 2e238485-e144-4d1f-aa1a-13097d3a3e99
        Total devices 1 FS bytes used 66.90GiB
        devid    1 size 232.88GiB used 83.02GiB path /dev/nvme1n1p1

Label: none  uuid: 05612964-8ba0-475a-b544-e716f5a03167
        Total devices 1 FS bytes used 196.00KiB
        devid    1 size 465.76GiB used 5.02GiB path /dev/sdj1

Label: none  uuid: b43c9020-8d0b-4e0e-a3c3-39ec11f9e096
        Total devices 1 FS bytes used 3.64TiB
        devid    1 size 7.28TiB used 3.70TiB path /dev/sdh1

 

Link to comment

Here's the output:

 

root@GrayBigBerta:~# echo 1 > /sys/block/sdb/device/delete
root@GrayBigBerta:~# btrfs fi show
ERROR: superblock checksum mismatch
ERROR: cannot scan /dev/sdg1: Input/output error
Label: none  uuid: af3b39b5-c791-4ea4-880d-fc1ad26cfc2d
        Total devices 1 FS bytes used 3.64TiB
        devid    1 size 7.28TiB used 3.72TiB path /dev/sdf1

Label: none  uuid: 39a4d42e-8ae1-436c-ae74-488cb24183bb
        Total devices 1 FS bytes used 685.45MiB
        devid    1 size 465.76GiB used 4.02GiB path /dev/nvme0n1p1

Label: none  uuid: ffae4078-e89d-4329-b2b9-bdd13773a8ec
        Total devices 1 FS bytes used 3.64TiB
        devid    1 size 7.28TiB used 3.71TiB path /dev/sdd1

Label: none  uuid: c40c0298-85ed-4130-aa76-cdacac9ccfa5
        Total devices 1 FS bytes used 120.40GiB
        devid    1 size 465.76GiB used 177.02GiB path /dev/sdm1

Label: none  uuid: 5f5f56e8-f435-4b81-9042-8cccd1fb7f8e
        Total devices 1 FS bytes used 76.42GiB
        devid    2 size 223.58GiB used 78.03GiB path /dev/sdi1

Label: none  uuid: d326d8d7-9da5-4d55-b3ad-43541260b369
        Total devices 1 FS bytes used 144.00KiB
        devid    1 size 931.51GiB used 3.02GiB path /dev/nvme2n1p1

Label: none  uuid: cf55c94a-4fd3-4030-a415-1d96a475aa3c
        Total devices 1 FS bytes used 5.40TiB
        devid    1 size 7.28TiB used 5.47TiB path /dev/sde1

Label: none  uuid: a93250e3-43bc-41c9-adbc-76ac0b3b0b16
        Total devices 1 FS bytes used 46.88MiB
        devid    1 size 111.79GiB used 3.02GiB path /dev/sdl1

Label: none  uuid: 2e238485-e144-4d1f-aa1a-13097d3a3e99
        Total devices 1 FS bytes used 66.90GiB
        devid    1 size 232.88GiB used 83.02GiB path /dev/nvme1n1p1

Label: none  uuid: 05612964-8ba0-475a-b544-e716f5a03167
        Total devices 1 FS bytes used 196.00KiB
        devid    1 size 465.76GiB used 5.02GiB path /dev/sdj1

Label: none  uuid: b43c9020-8d0b-4e0e-a3c3-39ec11f9e096
        Total devices 1 FS bytes used 3.64TiB
        devid    1 size 7.28TiB used 3.70TiB path /dev/sdh1

 

Link to comment

Diags after starting array attached.

 

10 minutes ago, JorgeB said:

If I understood correctly you have a backup of both disks?

Kind of... I have a backup of most unraid shares and its content. So the content of both disks is included.

 

Note: I made the backup from the emulated fs (gladly there are two parity disks...).

Note 2: Shares are split automatically on directory level (High-water, standard configuration). So

 

Sadyl, I don't have a copy or clone of the disks themself, if that's what you mean.

graybigberta-diagnostics-20240703-1458.zip

Link to comment

On a second look, the missing disks don't appear as locations in the shares anymore. Seems like the data is "lost" from the array.

I guess that happend when the Unraid started to rebuild the array automatically a couple of days ago...

Link to comment

Both emulated disks are not mounting, are you sure if they were mounting or not when you did the backup? If they weren't mounting at the time, and that would be the most like, not data would be copied from them.

 

This error is kind of strange but I think it may not be recoverable, I also see data corruption being detected in multiple disks, so you may have bad RAM, which could or not be related to the current problem:

 

Jul  3 14:58:54 GrayBigBerta kernel: BTRFS info (device md2p1): bdev /dev/md2p1 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0

Jul  3 14:59:05 GrayBigBerta kernel: BTRFS info (device md4p1): bdev /dev/md4p1 errs: wr 359, rd 1, flush 0, corrupt 71, gen 0

Jul  3 14:59:09 GrayBigBerta kernel: BTRFS info (device md5p1): bdev /dev/md5p1 errs: wr 0, rd 0, flush 0, corrupt 287, gen 0

Jul  3 14:59:12 GrayBigBerta kernel: BTRFS info (device md6p1): bdev /dev/md6p1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0

 

First thing I would recommend is to run memtest for at least a couple of passes, you will also need to scrub all those disks, but that can be for later, run memtest now and post back the results, though keep in mind that memtest is only definite if errors are found.

 

 

 

 

 

Link to comment
38 minutes ago, JorgeB said:

If they weren't mounting at the time, and that would be the most like, not data would be copied from them.

No they weren't.

 

But this confuses me. Maybe you could help me understand in the meantime...

 

From my understanding the Unraid array parity can buffer disk failures, similar to a raid parity:

If I have an array with one parity and one array disk is failing for whatever reason, then the data on this failed disk is calculated from the parity and the remaining disks.

So as long as there is not a second disk failing the data should be available (emulated disk).

 

Same should apply to two parity disks and max of two array disks failing (which is the case at my setup).

 

Do I fundamantaly misunderstand something here?

Link to comment
54 minutes ago, fusselnerd said:

From my understanding the Unraid array parity can buffer disk failures, similar to a raid parity:

 

Parity help if a disk fails, but if together with that there's filesystem corruption, it cannot help with that part.

Link to comment

Keep in mind that memtest is only definitive if if finds errors, but for now, lets make a final try to recover the data, finish rebuilding both disks, we can then try to use a backup superblock to see if that works, but don't have much hope, if that doesn't work, you can then try using a file recovery app like UFS explorer on the rebuilt disks.

 

You will also need to scrub all the disks that have corruption detected, then reset the errors and monitor to see if new ones come up, but this is for later.

Link to comment
Posted (edited)

Ah, now I understand what you meant with "definitive" - letting memtest run until it finds an error.

Sorry, english is not my first language 😅

 

Ok, I assigned both disks and started the array. Rebuild started automatically.

But both disks are labled as "Unmountable: Unsupported or no file system".

I read in another post that in this case, the rebuild will not actually write anything on the disks.

I paused the rebuild for now.

 

Shall I proceed?

 

grafik.thumb.png.2ba165f1b6d0b74a2a5951e2e36359c8.png

Edited by fusselnerd
Link to comment
Just now, JorgeB said:

It will still write to the disks

Ok, thx. Rebuild is resuming.

 

Just now, JorgeB said:

those disks are the original disks 1 and 3 right? Or are they new and you still have the old ones?

They are the original ones in the original order.

Link to comment

OK, if this happens in the future never start rebuilding on top of the old disk if the emulated disk doesn't mount, if you hadn't tried that, the original disks could still be OK, rebuilding an unmountable disk will always result in an unmountable disk, but now there's no other option, and once they are rebuilt, we can see if the backup superblock helps, if it doesn't, you can run UFS explorer on them, that cannot be run on emulated disks.

Link to comment
Posted (edited)

Hi @JorgeB

Rebuild is complete.

Next step is

On 7/4/2024 at 12:49 PM, JorgeB said:

try to use a backup superblock

right?

Could you please guide me through the process?

 

 

On 7/4/2024 at 1:27 PM, JorgeB said:

if this happens in the future never start rebuilding on top of the old disk if the emulated disk doesn't mount, if you hadn't tried that, the original disks could still be OK, rebuilding an unmountable disk will always result in an unmountable disk

Got it and I keep it in mind.

I went through the Unraid docs again, it's mentioned there several times.

Lesson learned the hard way...

The only issue I have is, that I couldn't see if the drives are mountable before starting the array. And starting the array will automatically trigger the rebuild (though I might remember it wrong).

So in the future, I will test a temporary failed disk before e.g. by mounting it separatly, before starting the array, if such a situation ever happens again.

And of course, keep an eye open for fs errors.

 

But maybe this is a topic for another discussion.

Edited by fusselnerd
Link to comment
1 hour ago, fusselnerd said:

The only issue I have is, that I couldn't see if the drives are mountable before starting the array. And starting the array will automatically trigger the rebuild

After the disks get disabled they won't start rebuilding automatically, if they were unassigned, you can start the array without the disks assigned to see if the emulated disks are working.

 

Start the array in maintenance mode and post the output of:

 

btrfs-select-super -s 1 /dev/md1p1

and

btrfs-select-super -s 1 /dev/md3p1

 

 

Link to comment

Here we go:

root@GrayBigBerta:~# btrfs-select-super -s 1 /dev/md1p1
ERROR: superblock checksum mismatch
ERROR: superblock checksum mismatch
No valid Btrfs found on /dev/md1p1
ERROR: open ctree failed
root@GrayBigBerta:~# btrfs-select-super -s 1 /dev/md3p1
ERROR: superblock checksum mismatch
ERROR: superblock checksum mismatch
No valid Btrfs found on /dev/md3p1
ERROR: open ctree failed
Link to comment
6 minutes ago, JorgeB said:

After the disks get disabled they won't start rebuilding automatically, if they were unassigned, you can start the array without the disks assigned to see if the emulated disks are working.

I see, thank you!

Can you recommend a read about btrfs and Unraid array? I'm missing the fundamentals, obviously, so I'd like to dig into it a bit.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...