Lots of BTRFS errors


Go to solution Solved by pervel,

Recommended Posts

My cache pool consisting of 2 Samsung SSD 970 EVO NVME drives seem to be having issues. First, it was marked read-only. Then I decided to move all data to the array and reformat the cache pool and move the files back. But even after that I still get lots of errors (but so far not marked read-only):

 

Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 67, gen 0
Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20192 off 283238006784 csum 0x270bc44d expected csum 0x0c0e90f7 mirror 1
Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 42, gen 0
Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20179 off 133788278784 csum 0xaa8b2d45 expected csum 0x3b4fa3e7 mirror 1
Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20179 off 133788286976 csum 0x30968432 expected csum 0x0d3e5c34 mirror 1
Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0
Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 44, gen 0

 

What could be the cause of this? Are the SSDs failing?

basse-diagnostics-20221122-1605.zip

Link to comment
20 minutes ago, JorgeB said:

Btrfs is detecting data corruption, this is usually RAM related, and Ryzen with overclocked RAM like you have is known to corrupt data in some cases, start by correcting that then run a scrub.

 

Thanks for the fast response. I adjusted RAM down to 2133 MHz (this was the "Auto" setting in the BIOS) and ran a scrub. Sadly, I'm see the same errors both when running the scrub and when running a Windows VM from the cache pool.

 

basse-diagnostics-20221122-1811.zip

Link to comment

I don't understand what's going on. I have now run memtest on each of the 4 RAM sticks. 3 of them showed errors. Only one did not. What are the chances of that? I wonder if this means there is something else wrong with the computer.

 

Then I started Unraid with the single good RAM stick I deleted the files in syslog that scrub complained about. Then ran a new scrub and it finished without reporting errors.

 

However, when I start both of my Windows VMs, I still get loads of BTRFS errors in syslog. But scrub says nothing. How can this be?

 

basse-diagnostics-20221122-2152.zip

Link to comment

When I ran the scrub I got BTFS errors that referenced paths to files on the cache such as:

Nov 22 21:33:06 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 763, gen 0
Nov 22 21:33:06 Basse kernel: BTRFS warning (device dm-1): checksum error at logical 264171896832 on dev /dev/mapper/nvme0n1p1, physical 135347519488, root 5, inode 20179, offset 159330914304, length 4096, links 1 (path: domains/GamePC/vdisk2.qcow2)

After deleting those files I no longer get scrub errors.

 

But I still get errors when trying to run a VM from the cache I get a more generic error such as:

Nov 22 21:42:06 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 780, gen 0
Nov 22 21:42:08 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20188 off 10248269824 csum 0xbf691df4 expected csum 0xd7cd4596 mirror 1

 

Can you explain what is going on? Are somes files still damaged? Or is the physical SSD damaged? What is the best way forward for me?

 

I should note that it's only the two big Windows VMs that give these errors. I have a number of Linux VMs that don't seem to give errors. But they are also noticeably smaller.

Link to comment
12 hours ago, pervel said:

I don't understand what's going on. I have now run memtest on each of the 4 RAM sticks. 3 of them showed errors. Only one did not. What are the chances of that? I wonder if this means there is something else wrong with the computer.

That is strange, there could be a board/CPU issue, though is the other DIMM doesn't give any errors after multiple runs it could just be a bad batch.

 

6 hours ago, pervel said:

But I still get errors when trying to run a VM from the cache I get a more generic error such as:

Anbd if you run another scrub there are no errors? I yes it suggests there's still a RAM problem, or other hardware issue still causing data corruption.

Link to comment

I've run 4 passes of memtest on the remaining good RAM stick. It gives no errors. I've uses these 4 sticks for 2 years now without issues. So it is really strange.

 

There are no errors reported when running scrub. But when I run the VMs, I get BTRFS errors. The inode in the error message does indeed belong to the virtual disk file for the corresponding VMs. These files were copied over from the SSD at the time when I was still using the bad RAM sticks. Should that matter?

 

I plan to buy new RAM sticks today. I really hope there are no other hardware issues present. But it is worrying that 3 sticks have failed.

Link to comment
1 hour ago, JorgeB said:

It's strange, try moving the vdisk to the array, VM must be off, see if there are any errors, if there aren't move it back.

I did the following:

  1. Copied the vdisk from cache to array
  2. Changed the VM to point to the new vdisk location on the array
  3. Ran the VM. Did not get any errors in the log
  4. Copied the vdisk back from array to cache
  5. Changed the VM back to point to the vdisk location on the cache
  6. Ran the VM. Got the same BTRFS errors as before all pointing to the inode for the vdisk

What can I learn from this? The array is formatted XFS and not BTRFS. Should XFS give the same kind of errors if the file is damaged?

 

Btw, the VM appears to run just fine despite these errors. But maybe that's just luck.

 

Link to comment

I think I have found the problem! It's completely unrelated to any RAM issues. It has to do with how I have defined my vdisk. For a long time I have been using the following definition:

 

<disk type='file' device='disk'>
  <driver name='qemu' type='qcow2' cache='none'/>
  <source file='/mnt/user/domains/WinVPN/vdisk1.qcow2'/>
  <target dev='hdc' bus='virtio'/>
  <boot order='1'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>

 

Notice the line with cache='none'. I have changed that from the default cache='writeback'. I honestly cannot remember why I did that. I think I read somewhere that this had better performance. However, this appears to be the reason for all the BTRFS errors. If I change back to cache='writeback', I no longer get any BTRFS errors.

 

I'm happy to change back to the default but I do wonder if this is perhaps a bug?

Link to comment
2 hours ago, JorgeB said:

Not sure, don't remember previous complains, but let me see if I can duplicate that.

 

Sadly, I did get a couple of BTRFS errors today even with these changes. But far less than before where the log was filling up with them. So I don't know what I conclude right now. I'm getting new RAM sticks today. I will try those and report back.

  • Like 1
Link to comment

I've installed the new RAM. So far that really seems to have helped. I'm getting none of the BTRFS errors from before. 😀

 

So I guess the conclusion so far is that it really was bad RAM. Of course that worries me since it's hard to believe all 4 RAM sticks failed without some external reason.

 

I have not yet tried to change my vdisks back to cache='none'. I would like to see the system run stable for a few days straight first.

 

Thanks for the assistence, @JorgeB!

  • Like 1
Link to comment
  • Solution

I have now also tested with vdisk using cache='none' again. This time without any errors. Yay!

 

I think I can finally conclude that my problems were all hardware related and that new RAM modules have fixed the problem - at least for now. It is still quite strange to me that all of my four RAM modules apparently failed. That suggests an external cause. But I have no clue what that cause might be. I guess I just have to wait and see if it happens again.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.