BTRFS Corruption


Recommended Posts

Cross posting here as recommended. I am running v6.10rc4.

 


almost 24 hours and 15 passes on memtest the machine is totally stable. I don’t have any other noticeable issues other than seeing the btrfs corruption in the disk log. Running Plex and such with Docker and a Windows 11 VM with the C drive on the cache.

 

isn’t there any other option besides btrfs? I feel like it causes so many problems and I’ve seen numerous posts online about stability issues especially in a raid configuration.

 

can someone help? Want to make sure I am not on a path to lose the data on cache.

 

I’ve already rebuilt the cache by moving everything off with mover, deleting and recreating the pool then putting it all back. Fresh pool started producing corruption errors within 24 hours.

 

Additional note here, after finishing memtest I booted up normal but didn't start my Windows 11 VM for a few hours. Now I started it up and am using it and the errors are popping in the log. Could my img file just be corrupt and using mover to move it on or off the array is just bringing the corruption along. Is there an effective way to copy the contents from the img file to a new one? My Windows 11 VM has been operating fine.

 

thanks

Edited by nickp85
Link to comment

@JorgeB is there some way to tell the file that has corruption? I have 6 uncorrectable errors but the log output is not specific what file where is corrupted. I would like to find out if I can. If it's a file I can live without I'll just delete it and see if I get further corruption. I'm really thinking mover is just moving the corrupted files off and back on whenever I rebuild the cache

 

[ 8803.054697] BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 12, gen 0
[ 8803.057091] BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0
[ 8803.058429] BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
[ 8803.059273] BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 127944531968 on dev /dev/nvme1n1p1
[ 8803.059390] BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0
[ 8803.059559] BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 127941509120 on dev /dev/nvme1n1p1
[ 8803.060744] BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
[ 8803.061994] BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 127941509120 on dev /dev/nvme0n1p1
[ 8803.063656] BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 127944531968 on dev /dev/nvme0n1p1
[ 8803.064907] BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 127948152832 on dev /dev/nvme1n1p1
[ 8803.065907] BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 127948152832 on dev /dev/nvme0n1p1

 

I was looking online and it says you can use btrfs-inspect-internal to resolve logical blocks to path/file names however this utility does not seem to be included in Unraid.

 

The dmesg logging doesn't seem to include sector or path info like I see online that it should, only the logical number and device

Edited by nickp85
Link to comment
12 hours ago, nickp85 said:

almost 24 hours and 15 passes on memtest the machine is totally stable.

It still looklike memory issue. Suggest test in two way

 

1. download memtest86 from offical site, it is different version from Unraid, a USB bootable UEFI version, setting test with ** all CPU core **

 

2. try free version HCI memory test, set it to cover ~90% of free memory in Windows environment ( VM also fine ), i.e. 16G free memory, then set it to 2.5G x 6

 

I always use above method to quickly identify any memory issue.

 

12 hours ago, nickp85 said:

especially in a raid configuration.

No this issue found with RAIO 0 or 1.

 

Link to comment
1 minute ago, ChatNoir said:

Not the log for the scrub, but the Unraid log I think.

It doesn’t. Based on what I’ve read online the output of the scrub is supposed to show the file path as well but in this case it does not. Does that mean the uncorrectable error is somewhere in free space?

Link to comment
58 minutes ago, nickp85 said:

Does that mean the uncorrectable error is somewhere in free space?

Either they are in metadata, though usually "metadata leaf" or similar is mentioned in those cases, or they are related to now empty extents, in any case recommend backing up and re-formatting the pool.

 

This is an example of you it usually looks in the log when a scrub finds data corruption:

 

Tower1 kernel: BTRFS warning (device md1): checksum error at logical 1006087647232 on dev /dev/md1, physical 1005022294016, root 5, inode 512268, offset 16712871936, length 4096, links 1 (path: plots/plot-k32-2022-01-03-12-03-d59a0e1f87141f9355fd42074dd671c706152c741767e01bb52946991f4a9e59.plot) 
Tower1 kernel: BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0

 

Link to comment
17 hours ago, Vr2Io said:

1. download memtest86 from offical site, it is different version from Unraid, a USB bootable UEFI version, setting test with ** all CPU core **

 

2. try free version HCI memory test, set it to cover ~90% of free memory in Windows environment ( VM also fine ), i.e. 16G free memory, then set it to 2.5G x 6


the 24 hour memtest I ran was the uefi one from the memtest site using a different stick. The Unraid one doesn’t boot uefi.

 

I assigned 28GB to my Windows 11 VM and ran 12 copies of that HCI tool for 2000 MB each (last one was 1000 MB) and all ran to completion without error. Unraid was showing 98% system memory used and Windows showed 97% used.

 

I also used qemu-img convert to duplicate my VM’s raw image file and deleted the old one thinking possibly some of the corruption was related to the img file. The system still shows 6 uncorrectable errors without a file path when doing a scrub but no new errors for over 24 hours.

 

The system is definitely stable… guess next step is to try to recreate the cache pool again for the second time in a week.

 

as I’ve mentioned, would not even known there was an error if I hadn’t looked. The machine is on 24/7 and use it for multiple tasks in Unraid plus gaming VM. No trouble other than the btrfs log lines.

  • Like 1
Link to comment
48 minutes ago, JorgeB said:

Unlikely

Let me expand a little on this, nothing I've seen so far suggest an Unraid/btrfs problem, btrfs corruption with Unraid users is not that uncommon, though most times it's caused by hardware issues, I'm a subscriber to the btrfs mailing list and AFAIK there are no unexplained corruption issues with recent kernels, also no one else using Unraid with rc4 has complained so far, and many users have btrfs pool(s).

 

Data corruption detected still makes me think this was hardware related, but it's not certain, re-format the pool and see if it happens again quickly, if it does you should go back to the last known Unraid release that was stable for you and confirm it doesn't still happen.

Link to comment
12 hours ago, JorgeB said:

Like mentioned first try again with rc4 after re-formatting the pool.

Tried to move everything off but my docker.img file would not move off. I am using the XFS format for the docker image. Every time it would try to move it there were errors in the log about the corruption. I ended up deleting my docker img file entirely and then I even formatted each disk using XFS in unassigned devices before deleting all the partitions again and setting the pool back up. Ran a scrub while empty and no errors found. Now moving everything back.

 

Interesting enough though even though docker.img would not move (I even tried to stop/start array to break anything that may have it locked), docker would still start up fine if I turned it on.

 

I know the XFS image format is newer to Unraid for Docker so could there be some weirdness with xfs docker image on a btrfs cache pool?

 

Pulled a fresh diagnostic after rebuilding and transferring all back plus rebuilding my dockers.

 

Update: Ran scrub after all my stuff was back the way it should be and came back clean. At least I have a good starting point now.

 

nicknas2-diagnostics-20220331-2348.zip

Edited by nickp85
Link to comment
  • 6 months later...

@JorgeB - After a long while I think I discovered what triggers this. If you use a raw image file for your Windows VM and use cache = none, these errors start up pretty quickly from normal use of my gaming VM. Setting the caching to writeback seems to prevent further errors.

 

Current config for the Windows OS disk which seems to prevent the errors. Was reading something online about how not using writeback for guests like Windows can be weird with btrfs because of how Windows writes to the disk. Solution was either to do a no cow in btrfs for the folder with the disk image, or simply use writeback/writethrough cache mode.

    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback' io='threads' discard='unmap'/>
      <source file='/mnt/user/domains/Windows 11/windows11.img' index='2'/>
      <backingStore/>
      <target dev='hdc' bus='scsi'/>
      <boot order='1'/>
      <alias name='scsi0-0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>

 

Edited by nickp85
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.