BTRFS Cache Disk Unmountable

Rattus · March 29, 2022

Hey guys,

So i seem to have run into the same issue as a few people, and specifically this post, I have been trying to work through the FAQ but I'm not having any luck. I'll copy the terminal commands and output below.

From the linked FAQ post instructions:

Spoiler

root@Radon:~# ls
x/

{Comment added: Array NOT Started}
root@Radon:~# mount -o usebackuproot,ro /dev/nvme0n1 /x
mount: /x: mount point does not exist.
root@Radon:~# mount -o degraded,usebackuproot,ro /dev/nvme0n1 /x

{Comment added: Array Started}
mount: /x: mount point does not exist.
root@Radon:~# mount -o usebackuproot,ro /dev/nvme0n1 /x
mount: /x: mount point does not exist.
root@Radon:~# mount -o degraded,usebackuproot,ro /dev/nvme0n1 /x
mount: /x: mount point does not exist.
root@Radon:~# mount -o ro,notreelog,nologreplay /dev/nvme0n1 /x
mount: /x: mount point does not exist.

From BTRFS File system check advice on unraid:

Spoiler

root@Radon:~# btrfs check --readonly /dev/nvme0n1
Opening filesystem to check...
No valid Btrfs found on /dev/nvme0n1
ERROR: cannot open file system
root@Radon:~# btrfs check --readonly --check-data-csum /dev/nvme0n1
Opening filesystem to check...
No valid Btrfs found on /dev/nvme0n1
ERROR: cannot open file system

Different variation of the same BTRFS guide:

Spoiler

root@Radon:~# btrfs restore /dev/nvme0n1 /mnt/disk4/restore/
No valid Btrfs found on /dev/nvme0n1
Could not open root, trying backup super
No valid Btrfs found on /dev/nvme0n1
Could not open root, trying backup super
No valid Btrfs found on /dev/nvme0n1
Could not open root, trying backup super

From the Log:

Spoiler

Mar 30 00:42:29 Radon winbindd[23110]: initialize_winbindd_cache: clearing cache and re-creating with version number 2
Mar 30 00:42:29 Radon winbindd[23110]: [2022/03/30 00:42:29.797777, 0] ../../lib/util/become_daemon.c:135(daemon_ready)
Mar 30 00:42:29 Radon winbindd[23110]: daemon_ready: daemon 'winbindd' finished starting up and ready to serve connections
Mar 30 00:42:29 Radon emhttpd: shcmd (2047): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 25
Mar 30 00:42:29 Radon kernel: BTRFS info (device loop2): using free space tree
Mar 30 00:42:29 Radon kernel: BTRFS info (device loop2): has skinny extents
Mar 30 00:42:30 Radon root: Resize '/var/lib/docker' of 'max'
Mar 30 00:42:30 Radon emhttpd: shcmd (2049): /etc/rc.d/rc.docker start
Mar 30 00:42:30 Radon root: starting dockerd ...

Does anyone think they might be able to point me in the right direction on this one?

Thanks,

Rattus

Edited March 29, 2022 by Rattus
added command line detail

JorgeB · March 29, 2022

Hidden contents are empty, also please post the diagnostics.

Rattus · March 29, 2022

Sorry, that was my bad in editing, detail is now where it should be and diagnostics attached to this post

radon-diagnostics-20220330-0054.zip

JorgeB · March 29, 2022

Btrfs is detecting a lot of data corruption, before trying any recovery run memtest.

Rattus · March 29, 2022

ok, this going to sound stupid ... how does one run memtest on unraid. my googlefu is failing me tonight ...

JorgeB · March 29, 2022

It's an option in the Unraid boot menu, it only works with legacy/CSM boot, not UEFI.

Rattus · March 29, 2022

Ok ... Bit of an update.

Found Memtest, Rebooted to memtest, server went into kind of boot loop where it bought up the blue boot menu, it auto selected memtest, attempted to run, and rebooted, then did the same thing over and over. Assuming it was a ram issue, I removed the ram and replaced it with assumed good ram from my gaming rig. same thing happened. original ram is now back in the server.

Note, server would boot into normal unraid with both sets of ram, but neither set would do memtest.

Clearly there is something wrong with the system overall, so I will look at replacing it, but in the mean time is there any way of recovering the data on cache drive? my appdata folder was apparently on there and not on the main array so I've got none of my setup data that I would need if I was to rebuild the server with new hardware.

On a side note, I have ordered to 2.5" ssd's that I will use for the cache so that there is redundancy in the future, but if this is a ram issue then that actually wouldn't make a difference would it?

Thanks for the help thus far @JorgeB

JorgeB · March 29, 2022

8 minutes ago, Rattus said:

server went into kind of boot loop where it bought up the blue boot menu, it auto selected memtest, attempted to run, and rebooted, then did the same thing over and over.

As mentioned:

1 hour ago, JorgeB said:

it only works with legacy/CSM boot, not UEFI.

That means you're booting UEFI, either change to legacy BIOS/CSM or download latest Passmark Memtest86 that supports UEFI only.

itimpi · March 29, 2022

1 hour ago, Rattus said:

attempted to run, and rebooted, then did the same thing over and over.

This is what happens if you try to run memtest when booting in UEFI mode. If you want a version that can run in UEFI boot mode then download it from memtest86.com

Rattus · March 30, 2022

Thanks again for the help thus far gents.

Ok Update, I have put memtest on its own USB and did the test on the server. It was showing errors. Lots of errors.

Memory setup was 4x8GB DDR4 modules.

Removed one pair of DIMM's ran test. Test Ok.

Swapped DIMM pairs, Errors.

Removed 1 stick, no errors. Swapped with other stick pair, errors. Moved erroring DIMM to different slots to ensure wasn't bad traces on mobo still errors in multiple slots.

Conclusion 1 bad stick. Stick removed from system and quarantined.

Concurrently ran memtest on gaming rig. gaming rig showed no errors.

Memory setup on rig was 4x16GB DDR4.

Gaming rig ram has now been moved to Server, Ran Memtest (for safety) No errors.

Now that I have known good ram in the server how should I proceed in recovering the cache?

JorgeB · March 30, 2022

Now try again the recovery options in the FAQ.

17 hours ago, Rattus said:

mount: /x: mount point does not exist.

This means you didn't create the mountpoint first as instructed there.

17 hours ago, Rattus said:

/dev/nvme0n1

Also as instructed you must specify the partition, so it should be /dev/nvme0n1p1

Rattus · March 30, 2022

15 minutes ago, JorgeB said:

This means you didn't create the mountpoint first as instructed there.

So I'm confused, what am I doing wrong, the guide says use mkdir command, this what I have just done with the output and it's giving me a very similar output. The directory is there according to the LS command.

Spoiler

root@Radon:~# ls
root@Radon:~# mkdir /x
root@Radon:~# ls
root@Radon:~# sudo mkdir /x
mkdir: cannot create directory ‘/x’: File exists
root@Radon:~# ls
root@Radon:~# sudo ls
root@Radon:~# mount -o usebackuproot,ro /dev/nvme0n1p1 /x
mount: /x: mount(2) system call failed: No such file or directory.
root@Radon:~# mkdir /y
root@Radon:~# ls
root@Radon:~# mount -o usebackuproot,ro /dev/nvme0n1p1 /y
mount: /y: mount(2) system call failed: No such file or directory.
root@Radon:~# ls
root@Radon:~# cd ..
root@Radon:/# ls
bin/ dev/ home/ init@ lib64/ opt/ root/ sbin/ tmp/ var/ y/
boot/ etc/ hugetlbfs/ lib/ mnt/ proc/ run/ sys/ usr/ x/
root@Radon:/# mount -o usebackuproot,ro /dev/nvme0n1p1 /x
mount: /x: mount(2) system call failed: No such file or directory.
root@Radon:/# mount -o usebackuproot,ro /dev/nvme0n1p1 x/
mount: /x: mount(2) system call failed: No such file or directory.
root@Radon:/# mount -o usebackuproot,ro /dev/nvme0n1p1 x
mount: /x: mount(2) system call failed: No such file or directory.
root@Radon:/#

From the guide:

Quote

Create a temporary mount point, e.g.:

mkdir /x

Edited March 30, 2022 by Rattus

JorgeB · March 30, 2022

52 minutes ago, Rattus said:

mount(2) system call failed: No such file or directory.

This is a different error, mountpoint now exists, though not sure what this error is about

Rattus · March 30, 2022

Quick update,

I moved forward with the guide to see if any other errors might shine a light on what is happening.

I didn't get any errors this time and now the following command is running and files are being copied to the destination:

> btrfs restore -v /dev/nvme0n1p1 /mnt/disk4/restore

there have been a few times when it has said something along the lines of:

> we seem to be looping a lot on <file name>, do you want to continue? (y/n/a)

It seems like it's only been doing that on rather large files (databases, or logs, and now my vm's), I can live without the logs, and probably the vm's, but will the database files be ok?

Also, it is clear that the files are still there and they are (relatively), ok, how do I go about bringing the cache disk back? hopefully without nuking it? (fingers crossed)

JorgeB · March 30, 2022

7 minutes ago, Rattus said:

there have been a few times when it has said something along the lines of:

> we seem to be looping a lot on <file name>, do you want to continue? (y/n/a)

That's normal, you can use a for all.

8 minutes ago, Rattus said:

but will the database files be ok?

Could be, though possibly there will be data corruption due to the RAM issues.

8 minutes ago, Rattus said:

how do I go about bringing the cache disk back?

After recovering everything you can you should format it then restore the data.

Rattus · March 30, 2022

14 minutes ago, JorgeB said:

Could be, though possibly there will be data corruption due to the RAM issues.

Oh Lorde, I hope not ... I guess we will see.

14 minutes ago, JorgeB said:

After recovering everything you can you should format it then restore the data.

So that will be the nerve racking section, I'm going to try that tomorrow after I've had sleep and will hopefully be able to concentrate and not f it up.

I'm guessing a few things final things here, could you please clarify/answer the below?:

There's no way to compare what was copied vs what is on the drive to make sure it was all moved across before I nuke the cache disk?
Restoring the data is as simple as copying the "restored" data to the freshly formatted cache disk? and unraid will know what to do with it? or will I need to put in the correct place on the array?
If formatting the disk will completely erase it anyway, is there any harm in trying the "btrfs check --repair /dev/nvme0n1p1" command and seeing if that will fix it? and then if it doesn't work then do the format?
assuming that point 3 is not recommended/doesn't work, I assume that xfs would be a better format for the cache disk? just based on reading through things with this whole experience.
last question. so we are working on the assumption that the faulty ram module caused the corruption. if that problem happens again, and I have 2 cache drives in a parity type of arrangement, however unraid actually manages that, having them in party/replicated wouldn't help because the bad ram will affect both? doing that particular setup will only help with a drive failure?

Well guys, if we can get answers to the above, hopefully the final post in this thread is that it's all up and running again.

I want to say, what ever happens now, you guys have been great, thanks for speedy reply's, and hopefully I have good news for you tomorrow.

JorgeB · March 30, 2022

35 minutes ago, Rattus said:

There's no way to compare what was copied vs what is on the drive to make sure it was all moved across before I nuke the cache disk?

Not unless you can mount it.

36 minutes ago, Rattus said:

Restoring the data is as simple as copying the "restored" data to the freshly formatted cache disk? and unraid will know what to do with it? or will I need to put in the correct place on the array?

Btrfs restore should maintain the paths, just copy keeping them.

37 minutes ago, Rattus said:

If formatting the disk will completely erase it anyway, is there any harm in trying the "btrfs check --repair /dev/nvme0n1p1" command and seeing if that will fix it? and then if it doesn't work then do the format?

You can try.

37 minutes ago, Rattus said:

assuming that point 3 is not recommended/doesn't work, I assume that xfs would be a better format for the cache disk?

XFS is usually more tolerant of hardware issues and easier to recover, on the other hand hardware issues/data corruption can go unnoticed for longer.

38 minutes ago, Rattus said:

last question. so we are working on the assumption that the faulty ram module caused the corruption. if that problem happens again, and I have 2 cache drives in a parity type of arrangement, however unraid actually manages that, having them in party/replicated wouldn't help because the bad ram will affect both? doing that particular setup will only help with a drive failure?

No, neither parity or mirrors can help with data corruption caused by RAM/bad hardware, you can change your hardware to a board/CPU combo that supports ECC RAM to avoid bad RAM issues, but also keep in mind that you always need to have backups of anything important, there are many different ways you can lose data.

Rattus · March 31, 2022

Well Gents,

Its been a rough ride, and a rough couple of days. But the "btrfs check --repair" did the trick, the disk mounted, i rebooted just to make sure everything reloaded ok, and all my data is back, my docker containers are working fine, and all is good with the world.

Thanks heaps for your help, and answering my stupid questions, hopefully your answers can help someone else in a similar boat to me down the line.

I think has reminded me how past due I am for an offsite backup of these config and other important files. Time to investigate a backblaze type of solution.

Thanks again everyone who commented, much appreciated and much Love 💗

Rattus

BTRFS Cache Disk Unmountable

Recommended Posts

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

JorgeB

Link to comment

Rattus

Link to comment

Join the conversation