pervel Posted November 22, 2022 Share Posted November 22, 2022 My cache pool consisting of 2 Samsung SSD 970 EVO NVME drives seem to be having issues. First, it was marked read-only. Then I decided to move all data to the array and reformat the cache pool and move the files back. But even after that I still get lots of errors (but so far not marked read-only): Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 67, gen 0 Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20192 off 283238006784 csum 0x270bc44d expected csum 0x0c0e90f7 mirror 1 Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 42, gen 0 Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20179 off 133788278784 csum 0xaa8b2d45 expected csum 0x3b4fa3e7 mirror 1 Nov 22 13:35:59 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20179 off 133788286976 csum 0x30968432 expected csum 0x0d3e5c34 mirror 1 Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0 Nov 22 13:35:59 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 44, gen 0 What could be the cause of this? Are the SSDs failing? basse-diagnostics-20221122-1605.zip Quote Link to comment
JorgeB Posted November 22, 2022 Share Posted November 22, 2022 Btrfs is detecting data corruption, this is usually RAM related, and Ryzen with overclocked RAM like you have is known to corrupt data in some cases, start by correcting that then run a scrub. Quote Link to comment
pervel Posted November 22, 2022 Author Share Posted November 22, 2022 20 minutes ago, JorgeB said: Btrfs is detecting data corruption, this is usually RAM related, and Ryzen with overclocked RAM like you have is known to corrupt data in some cases, start by correcting that then run a scrub. Thanks for the fast response. I adjusted RAM down to 2133 MHz (this was the "Auto" setting in the BIOS) and ran a scrub. Sadly, I'm see the same errors both when running the scrub and when running a Windows VM from the cache pool. basse-diagnostics-20221122-1811.zip Quote Link to comment
JorgeB Posted November 22, 2022 Share Posted November 22, 2022 Assuming the RAM issue is corrected, you can also run memtest to see if there's a bad DIMM, delete/restore from backups the corrupt files listed on the syslog after the scrub. Quote Link to comment
pervel Posted November 22, 2022 Author Share Posted November 22, 2022 I'm running MemTest86 right now and sadly it is indeed reporting errors. I have no experience with this program. Does it sound like this is the reason for my problems? I don't love having to replace all that RAM. ☹️ Quote Link to comment
JorgeB Posted November 22, 2022 Share Posted November 22, 2022 If memtest finds errors there's a problem, try testing the DIMMs one by one. Quote Link to comment
pervel Posted November 22, 2022 Author Share Posted November 22, 2022 Yes, I will do that. Thanks for your help. Quote Link to comment
pervel Posted November 22, 2022 Author Share Posted November 22, 2022 I don't understand what's going on. I have now run memtest on each of the 4 RAM sticks. 3 of them showed errors. Only one did not. What are the chances of that? I wonder if this means there is something else wrong with the computer. Then I started Unraid with the single good RAM stick I deleted the files in syslog that scrub complained about. Then ran a new scrub and it finished without reporting errors. However, when I start both of my Windows VMs, I still get loads of BTRFS errors in syslog. But scrub says nothing. How can this be? basse-diagnostics-20221122-2152.zip Quote Link to comment
pervel Posted November 23, 2022 Author Share Posted November 23, 2022 When I ran the scrub I got BTFS errors that referenced paths to files on the cache such as: Nov 22 21:33:06 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 763, gen 0 Nov 22 21:33:06 Basse kernel: BTRFS warning (device dm-1): checksum error at logical 264171896832 on dev /dev/mapper/nvme0n1p1, physical 135347519488, root 5, inode 20179, offset 159330914304, length 4096, links 1 (path: domains/GamePC/vdisk2.qcow2) After deleting those files I no longer get scrub errors. But I still get errors when trying to run a VM from the cache I get a more generic error such as: Nov 22 21:42:06 Basse kernel: BTRFS error (device dm-1): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 780, gen 0 Nov 22 21:42:08 Basse kernel: BTRFS warning (device dm-1): csum failed root 5 ino 20188 off 10248269824 csum 0xbf691df4 expected csum 0xd7cd4596 mirror 1 Can you explain what is going on? Are somes files still damaged? Or is the physical SSD damaged? What is the best way forward for me? I should note that it's only the two big Windows VMs that give these errors. I have a number of Linux VMs that don't seem to give errors. But they are also noticeably smaller. Quote Link to comment
JorgeB Posted November 23, 2022 Share Posted November 23, 2022 12 hours ago, pervel said: I don't understand what's going on. I have now run memtest on each of the 4 RAM sticks. 3 of them showed errors. Only one did not. What are the chances of that? I wonder if this means there is something else wrong with the computer. That is strange, there could be a board/CPU issue, though is the other DIMM doesn't give any errors after multiple runs it could just be a bad batch. 6 hours ago, pervel said: But I still get errors when trying to run a VM from the cache I get a more generic error such as: Anbd if you run another scrub there are no errors? I yes it suggests there's still a RAM problem, or other hardware issue still causing data corruption. Quote Link to comment
pervel Posted November 23, 2022 Author Share Posted November 23, 2022 I've run 4 passes of memtest on the remaining good RAM stick. It gives no errors. I've uses these 4 sticks for 2 years now without issues. So it is really strange. There are no errors reported when running scrub. But when I run the VMs, I get BTRFS errors. The inode in the error message does indeed belong to the virtual disk file for the corresponding VMs. These files were copied over from the SSD at the time when I was still using the bad RAM sticks. Should that matter? I plan to buy new RAM sticks today. I really hope there are no other hardware issues present. But it is worrying that 3 sticks have failed. Quote Link to comment
JorgeB Posted November 23, 2022 Share Posted November 23, 2022 It's strange, try moving the vdisk to the array, VM must be off, see if there are any errors, if there aren't move it back. Quote Link to comment
pervel Posted November 23, 2022 Author Share Posted November 23, 2022 1 hour ago, JorgeB said: It's strange, try moving the vdisk to the array, VM must be off, see if there are any errors, if there aren't move it back. I did the following: Copied the vdisk from cache to array Changed the VM to point to the new vdisk location on the array Ran the VM. Did not get any errors in the log Copied the vdisk back from array to cache Changed the VM back to point to the vdisk location on the cache Ran the VM. Got the same BTRFS errors as before all pointing to the inode for the vdisk What can I learn from this? The array is formatted XFS and not BTRFS. Should XFS give the same kind of errors if the file is damaged? Btw, the VM appears to run just fine despite these errors. But maybe that's just luck. Quote Link to comment
JorgeB Posted November 23, 2022 Share Posted November 23, 2022 Copy the vdisk from the array back to cache replacing the old one. Quote Link to comment
pervel Posted November 23, 2022 Author Share Posted November 23, 2022 2 minutes ago, JorgeB said: Copy the vdisk from the array back to cache replacing the old one. But that's what I did. And still the same errors. Quote Link to comment
pervel Posted November 23, 2022 Author Share Posted November 23, 2022 I think I have found the problem! It's completely unrelated to any RAM issues. It has to do with how I have defined my vdisk. For a long time I have been using the following definition: <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/mnt/user/domains/WinVPN/vdisk1.qcow2'/> <target dev='hdc' bus='virtio'/> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> Notice the line with cache='none'. I have changed that from the default cache='writeback'. I honestly cannot remember why I did that. I think I read somewhere that this had better performance. However, this appears to be the reason for all the BTRFS errors. If I change back to cache='writeback', I no longer get any BTRFS errors. I'm happy to change back to the default but I do wonder if this is perhaps a bug? Quote Link to comment
JorgeB Posted November 24, 2022 Share Posted November 24, 2022 14 hours ago, pervel said: I'm happy to change back to the default but I do wonder if this is perhaps a bug? Not sure, don't remember previous complains, but let me see if I can duplicate that. Quote Link to comment
pervel Posted November 24, 2022 Author Share Posted November 24, 2022 2 hours ago, JorgeB said: Not sure, don't remember previous complains, but let me see if I can duplicate that. Sadly, I did get a couple of BTRFS errors today even with these changes. But far less than before where the log was filling up with them. So I don't know what I conclude right now. I'm getting new RAM sticks today. I will try those and report back. 1 Quote Link to comment
pervel Posted November 24, 2022 Author Share Posted November 24, 2022 I've installed the new RAM. So far that really seems to have helped. I'm getting none of the BTRFS errors from before. 😀 So I guess the conclusion so far is that it really was bad RAM. Of course that worries me since it's hard to believe all 4 RAM sticks failed without some external reason. I have not yet tried to change my vdisks back to cache='none'. I would like to see the system run stable for a few days straight first. Thanks for the assistence, @JorgeB! 1 Quote Link to comment
Solution pervel Posted November 25, 2022 Author Solution Share Posted November 25, 2022 I have now also tested with vdisk using cache='none' again. This time without any errors. Yay! I think I can finally conclude that my problems were all hardware related and that new RAM modules have fixed the problem - at least for now. It is still quite strange to me that all of my four RAM modules apparently failed. That suggests an external cause. But I have no clue what that cause might be. I guess I just have to wait and see if it happens again. 1 Quote Link to comment
Josef Posted January 2, 2023 Share Posted January 2, 2023 I am actually experiencing very similar issue. Unraid server on AMD platform with 32GB RAM. I was using XMP profile as of the RAM specifications and all was fine for about 2 years, but recently, I was getting btrfs issues, ran memtest86 and - got lots of errors. Swapped the RAM with my Windows PC (also AMD, but different board/cpu ) and it confirmed the above, RAM from Unraid failed on different PC as well, but RAM form that PC was fine in Unraid. I am currently in RMA process (lifetime warranty), Unraid is running on 16GB 3200 MHz from reserve. Also "memtested", and all good. Running on XMP profile, as memtest did not report any errors on this speed. Wondering why it happened after about 2 years without any issues, I was even running VMs, so the RAM was about 80% used at times. To date, I had only RAM which either did not work from day 1 or did and then it did all the time. Quote Link to comment
Fastcompjason Posted July 3 Share Posted July 3 I just found this post and had replied to a similar post (linked). I have been testing my RAM and have tested before without any errors reporting at all. I am also running an AMD Ryzen 9 3900X system on MSI X570 Prestige Creation mobo with 32GB (4 sticks of) G.Skill F43600-C16-8GVKC. I built this system back in late 2019 and it was running perfectly for a long time then I started having issues after some BIOS updates and updating unRaid. I am unsure now if I can backdate to an earlier BIOS revision and I am unsure of the version that I was most stable with. Also, one of the benefits i was looking forward to with the newer BIOS revisions was the better (more separated) IOMMU groups for VMs and other uses. I am curious if you are still running that same system and if you could help with any information about what "fixed" your issues. Thanks. Quote Link to comment
pervel Posted July 3 Author Share Posted July 3 9 hours ago, Fastcompjason said: I am curious if you are still running that same system and if you could help with any information about what "fixed" your issues. Thanks. My issue was solved by replacing faulty RAM modules. If your RAM is not faulty, then I guess you have a different issue. You're probably better off starting a new thread. 😊 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.