Corrupt cache drive

CorneliusCornbread · December 9, 2023

So recently I just had my cache absolutely die on me, I'm unable to mount it and it says its unmountable in unraid, attempting to try and read the drive from another device using another computer yields the exact same output as Unraid. Is there something I can do to try and recover my files or are they gone forever?

[  665.270184] BTRFS info (device sdc1): using crc32c (crc32c-intel) checksum algorithm
[  665.270196] BTRFS info (device sdc1): using free space tree
[  665.271019] BTRFS error (device sdc1): devid 1 uuid 1534a103-adb6-4af8-97dd-604441e7394f is missing
[  665.271031] BTRFS error (device sdc1): failed to read the system array: -2
[  665.271450] BTRFS error (device sdc1): open_ctree failed
[  677.052903] BTRFS: device fsid a1f55575-2547-45b5-b8a3-86cc81362f17 devid 2 transid 698149 /dev/sdc1 scanned by mount (11480)
[  677.053635] BTRFS info (device sdc1): using crc32c (crc32c-intel) checksum algorithm
[  677.053644] BTRFS info (device sdc1): using free space tree
[  677.054517] BTRFS error (device sdc1): devid 1 uuid 1534a103-adb6-4af8-97dd-604441e7394f is missing
[  677.054525] BTRFS error (device sdc1): failed to read the system array: -2
[  677.054827] BTRFS error (device sdc1): open_ctree failed
[  679.731194] BTRFS: device fsid a1f55575-2547-45b5-b8a3-86cc81362f17 devid 2 transid 698149 /dev/sdc1 scanned by mount (11506)
[  679.732081] BTRFS info (device sdc1): using crc32c (crc32c-intel) checksum algorithm
[  679.732094] BTRFS info (device sdc1): using free space tree
[  679.733048] BTRFS error (device sdc1): devid 1 uuid 1534a103-adb6-4af8-97dd-604441e7394f is missing
[  679.733059] BTRFS error (device sdc1): failed to read the system array: -2
[  679.733930] BTRFS error (device sdc1): open_ctree failed

Also for what its worth, it seems to have been related to a cache move, moving a movie from the cache to the array, as that's about the time our drives went corrupt and our server went down.

Edited December 9, 2023 by CorneliusCornbread

CorneliusCornbread · December 9, 2023

Here's another log when trying to start the array on my server with the cache drives connected

rose-plex-log-section.txt

Edited December 9, 2023 by CorneliusCornbread

JorgeB · December 9, 2023

Please post the diagnostics but according to that snipped a device is missing, any idea where it is?

CorneliusCornbread · December 9, 2023

11 hours ago, JorgeB said:

Please post the diagnostics but according to that snipped a device is missing, any idea where it is?

The first set of logs is from just plugging the paritied drive into my other PC using a USB to sata adapter, the second set of logs is from the Unraid server itself with both drives. It seems to stop complaining about that on the actual unraid box with both drives. I included the second one for that reason.

Also, while my server's been down, I ran Memtest86 all of last night and through some of the afternoon, 16 hours of tests yielded no bad memory issues so I've ruled that out.

Here's the diagnostics rose-plex-diagnostics-20231209-1637.zip

JorgeB · December 10, 2023

If the log tree is the only problem this may help:

btrfs rescue zero-log /dev/nvme0n1p1

Then restart the array

CorneliusCornbread · December 10, 2023

8 hours ago, JorgeB said:
If the log tree is the only problem this may help:
btrfs rescue zero-log /dev/nvme0n1p1
Then restart the array

That seems to have worked! Thank you so much! I'm backing up my appdata directory

CorneliusCornbread · December 17, 2023

On 12/10/2023 at 6:16 AM, JorgeB said:
If the log tree is the only problem this may help:
btrfs rescue zero-log /dev/nvme0n1p1
Then restart the array

So only I'm able to read from the cache using this (kinda sorta, sometimes I can write sometimes I can't), (sorry for the late response I just got finished with finals), and I've backed up everything I need so at this point I'm trying to figure out if I need to blow away my cache and start from scratch or if I can get the file system sorted.

We replaced the drive that we think was causing the issue, we think it was our nvme drive going bad, for some reason we couldn't get smart reports to work on it at all. After replacing the drive I let the cache array rebuild itself overnight.

Running a btrfs check yields this

    [1/7] checking root items
    [2/7] checking extents
    data extent[5392462663680, 16384] referencer count mismatch (root 5 owner 12824979 offset 6780850176) wanted 0 have 1
    data extent[5392462663680, 16384] bytenr mimsmatch, extent item bytenr 5392462663680 file item bytenr 0
    data extent[5392462663680, 16384] referencer count mismatch (root 5583673057798520837 owner 4294936705 offset 6780850176) wanted 1 have 0
    backpointer mismatch on [5392462663680 16384]
    ERROR: errors found in extent allocation tree or chunk allocation
    [3/7] checking free space tree
    [4/7] checking fs roots
    [5/7] checking only csums items (without verifying data)
    [6/7] checking root refs
    [7/7] checking quota groups skipped (not enabled on this FS)
    Opening filesystem to check...
    warning, device 1 is missing
    Checking filesystem on /dev/sdd1
    UUID: a1f55575-2547-45b5-b8a3-86cc81362f17
    found 283721043968 bytes used, error(s) found
    total csum bytes: 163032844
    total tree bytes: 798720000
    total fs tree bytes: 477118464
    total extent tree bytes: 121241600
    btree space waste bytes: 154372644
    file data blocks allocated: 1237013446656
     referenced 270644371456

And attempting to do a repair via the check just has the repair abort

    enabling repair mode
    WARNING:

    	Do not use --repair unless you are advised to do so by a developer
    	or an experienced user, and then only after having accepted that no
    	fsck can successfully repair all types of filesystem corruption. E.g.
    	some software or hardware bugs can fatally damage a volume.
    	The operation will start in 10 seconds.
    	Use Ctrl-C to stop it.
    10 9 8 7 6 5 4 3 2 1[1/7] checking root items
    Fixed 0 roots.
    [2/7] checking extents
    data extent[5392462663680, 16384] referencer count mismatch (root 5 owner 12824979 offset 6780850176) wanted 0 have 1
    data extent[5392462663680, 16384] bytenr mimsmatch, extent item bytenr 5392462663680 file item bytenr 0
    data extent[5392462663680, 16384] referencer count mismatch (root 5583673057798520837 owner 4294936705 offset 6780850176) wanted 1 have 0
    backpointer mismatch on [5392462663680 16384]
    Unable to find block group for 0
    Unable to find block group for 0
    Unable to find block group for 0
    failed to repair damaged filesystem, aborting

    Starting repair.
    Opening filesystem to check...
    warning, device 1 is missing
    Checking filesystem on /dev/sdd1
    UUID: a1f55575-2547-45b5-b8a3-86cc81362f17

Is the file system beyond repair? If so what's the easiest way to blow it away and start from scratch, I'm going to need to recreate my system and appdata directories for sure as those were cache only.

Edited December 17, 2023 by CorneliusCornbread

JorgeB · December 18, 2023

Recommend you backup and recreate the pool, to wipe the pool you can click on "erase".

Corrupt cache drive

Recommended Posts

CorneliusCornbread

Link to comment

CorneliusCornbread

Link to comment

JorgeB

Link to comment

CorneliusCornbread

Link to comment

JorgeB

Link to comment

CorneliusCornbread

Link to comment

CorneliusCornbread

Link to comment

JorgeB

Link to comment

Join the conversation