Lots of BTRFS errors

pervel · November 22, 2022

My cache pool consisting of 2 Samsung SSD 970 EVO NVME drives seem to be having issues. First, it was marked read-only. Then I decided to move all data to the array and reformat the cache pool and move the files back. But even after that I still get lots of errors (but so far not marked read-only):

Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 67, gen 0
Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20192 off 283238006784 csum 0x270bc44d expected csum 0x0c0e90f7 mirror 1
Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 42, gen 0
Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20179 off 133788278784 csum 0xaa8b2d45 expected csum 0x3b4fa3e7 mirror 1
Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20179 off 133788286976 csum 0x30968432 expected csum 0x0d3e5c34 mirror 1
Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0
Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 44, gen 0

What could be the cause of this? Are the SSDs failing?

basse-diagnostics-20221122-1605.zip

JorgeB · November 22, 2022

Btrfs is detecting data corruption, this is usually RAM related, and Ryzen with overclocked RAM like you have is known to corrupt data in some cases, start by correcting that then run a scrub.

pervel · November 22, 2022

20 minutes ago, JorgeB said:

Btrfs is detecting data corruption, this is usually RAM related, and Ryzen with overclocked RAM like you have is known to corrupt data in some cases, start by correcting that then run a scrub.

Thanks for the fast response. I adjusted RAM down to 2133 MHz (this was the "Auto" setting in the BIOS) and ran a scrub. Sadly, I'm see the same errors both when running the scrub and when running a Windows VM from the cache pool.

basse-diagnostics-20221122-1811.zip

JorgeB · November 22, 2022

Assuming the RAM issue is corrected, you can also run memtest to see if there's a bad DIMM, delete/restore from backups the corrupt files listed on the syslog after the scrub.

pervel · November 22, 2022

I'm running MemTest86 right now and sadly it is indeed reporting errors. I have no experience with this program. Does it sound like this is the reason for my problems? I don't love having to replace all that RAM. ☹️

JorgeB · November 22, 2022

If memtest finds errors there's a problem, try testing the DIMMs one by one.

pervel · November 22, 2022

Yes, I will do that. Thanks for your help.

pervel · November 22, 2022

I don't understand what's going on. I have now run memtest on each of the 4 RAM sticks. 3 of them showed errors. Only one did not. What are the chances of that? I wonder if this means there is something else wrong with the computer.

Then I started Unraid with the single good RAM stick I deleted the files in syslog that scrub complained about. Then ran a new scrub and it finished without reporting errors.

However, when I start both of my Windows VMs, I still get loads of BTRFS errors in syslog. But scrub says nothing. How can this be?

basse-diagnostics-20221122-2152.zip

pervel · November 23, 2022

When I ran the scrub I got BTFS errors that referenced paths to files on the cache such as:

Nov 22 21:33:06 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 763, gen 0
Nov 22 21:33:06 Basse kernel: BTRFS warning (device dm-1): checksum error at logical 264171896832 on dev /dev/mapper/nvme0n1p1, physical 135347519488, root 5, inode 20179, offset 159330914304, length 4096, links 1 (path: domains/GamePC/vdisk2.qcow2)

After deleting those files I no longer get scrub errors.

But I still get errors when trying to run a VM from the cache I get a more generic error such as:

Nov 22 21:42:06 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 780, gen 0
Nov 22 21:42:08 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20188 off 10248269824 csum 0xbf691df4 expected csum 0xd7cd4596 mirror 1

Can you explain what is going on? Are somes files still damaged? Or is the physical SSD damaged? What is the best way forward for me?

I should note that it's only the two big Windows VMs that give these errors. I have a number of Linux VMs that don't seem to give errors. But they are also noticeably smaller.

JorgeB · November 23, 2022

12 hours ago, pervel said:

I don't understand what's going on. I have now run memtest on each of the 4 RAM sticks. 3 of them showed errors. Only one did not. What are the chances of that? I wonder if this means there is something else wrong with the computer.

That is strange, there could be a board/CPU issue, though is the other DIMM doesn't give any errors after multiple runs it could just be a bad batch.

6 hours ago, pervel said:

But I still get errors when trying to run a VM from the cache I get a more generic error such as:

Anbd if you run another scrub there are no errors? I yes it suggests there's still a RAM problem, or other hardware issue still causing data corruption.

pervel · November 23, 2022

I've run 4 passes of memtest on the remaining good RAM stick. It gives no errors. I've uses these 4 sticks for 2 years now without issues. So it is really strange.

There are no errors reported when running scrub. But when I run the VMs, I get BTRFS errors. The inode in the error message does indeed belong to the virtual disk file for the corresponding VMs. These files were copied over from the SSD at the time when I was still using the bad RAM sticks. Should that matter?

I plan to buy new RAM sticks today. I really hope there are no other hardware issues present. But it is worrying that 3 sticks have failed.

JorgeB · November 23, 2022

It's strange, try moving the vdisk to the array, VM must be off, see if there are any errors, if there aren't move it back.

pervel · November 23, 2022

1 hour ago, JorgeB said:

It's strange, try moving the vdisk to the array, VM must be off, see if there are any errors, if there aren't move it back.

I did the following:

Copied the vdisk from cache to array
Changed the VM to point to the new vdisk location on the array
Ran the VM. Did not get any errors in the log
Copied the vdisk back from array to cache
Changed the VM back to point to the vdisk location on the cache
Ran the VM. Got the same BTRFS errors as before all pointing to the inode for the vdisk

What can I learn from this? The array is formatted XFS and not BTRFS. Should XFS give the same kind of errors if the file is damaged?

Btw, the VM appears to run just fine despite these errors. But maybe that's just luck.

JorgeB · November 23, 2022

Copy the vdisk from the array back to cache replacing the old one.

pervel · November 23, 2022

2 minutes ago, JorgeB said:

Copy the vdisk from the array back to cache replacing the old one.

But that's what I did. And still the same errors.

pervel · November 23, 2022

I think I have found the problem! It's completely unrelated to any RAM issues. It has to do with how I have defined my vdisk. For a long time I have been using the following definition:

<disk type='file' device='disk'>
  <driver name='qemu' type='qcow2' cache='none'/>
  <source file='/mnt/user/domains/WinVPN/vdisk1.qcow2'/>
  <target dev='hdc' bus='virtio'/>
  <boot order='1'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>

Notice the line with cache='none'. I have changed that from the default cache='writeback'. I honestly cannot remember why I did that. I think I read somewhere that this had better performance. However, this appears to be the reason for all the BTRFS errors. If I change back to cache='writeback', I no longer get any BTRFS errors.

I'm happy to change back to the default but I do wonder if this is perhaps a bug?

JorgeB · November 24, 2022

14 hours ago, pervel said:

I'm happy to change back to the default but I do wonder if this is perhaps a bug?

Not sure, don't remember previous complains, but let me see if I can duplicate that.

pervel · November 24, 2022

2 hours ago, JorgeB said:

Not sure, don't remember previous complains, but let me see if I can duplicate that.

Sadly, I did get a couple of BTRFS errors today even with these changes. But far less than before where the log was filling up with them. So I don't know what I conclude right now. I'm getting new RAM sticks today. I will try those and report back.

pervel · November 24, 2022

I've installed the new RAM. So far that really seems to have helped. I'm getting none of the BTRFS errors from before. 😀

So I guess the conclusion so far is that it really was bad RAM. Of course that worries me since it's hard to believe all 4 RAM sticks failed without some external reason.

I have not yet tried to change my vdisks back to cache='none'. I would like to see the system run stable for a few days straight first.

Thanks for the assistence, @JorgeB!

pervel · November 25, 2022

I have now also tested with vdisk using cache='none' again. This time without any errors. Yay!

I think I can finally conclude that my problems were all hardware related and that new RAM modules have fixed the problem - at least for now. It is still quite strange to me that all of my four RAM modules apparently failed. That suggests an external cause. But I have no clue what that cause might be. I guess I just have to wait and see if it happens again.

Josef · January 2, 2023

I am actually experiencing very similar issue. Unraid server on AMD platform with 32GB RAM. I was using XMP profile as of the RAM specifications and all was fine for about 2 years, but recently, I was getting btrfs issues, ran memtest86 and - got lots of errors. Swapped the RAM with my Windows PC (also AMD, but different board/cpu ) and it confirmed the above, RAM from Unraid failed on different PC as well, but RAM form that PC was fine in Unraid.

I am currently in RMA process (lifetime warranty), Unraid is running on 16GB 3200 MHz from reserve. Also "memtested", and all good. Running on XMP profile, as memtest did not report any errors on this speed.

Wondering why it happened after about 2 years without any issues, I was even running VMs, so the RAM was about 80% used at times. To date, I had only RAM which either did not work from day 1 or did and then it did all the time.

Fastcompjason · July 3

I just found this post and had replied to a similar post (linked). I have been testing my RAM and have tested before without any errors reporting at all. I am also running an AMD Ryzen 9 3900X system on MSI X570 Prestige Creation mobo with 32GB (4 sticks of) G.Skill F43600-C16-8GVKC. I built this system back in late 2019 and it was running perfectly for a long time then I started having issues after some BIOS updates and updating unRaid. I am unsure now if I can backdate to an earlier BIOS revision and I am unsure of the version that I was most stable with. Also, one of the benefits i was looking forward to with the newer BIOS revisions was the better (more separated) IOMMU groups for VMs and other uses. I am curious if you are still running that same system and if you could help with any information about what "fixed" your issues. Thanks.

pervel · July 3

9 hours ago, Fastcompjason said:

I am curious if you are still running that same system and if you could help with any information about what "fixed" your issues. Thanks.

My issue was solved by replacing faulty RAM modules. If your RAM is not faulty, then I guess you have a different issue. You're probably better off starting a new thread. 😊

Lots of BTRFS errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation