Jump to content

constant btrfs corruption errors. would like some advice.


Recommended Posts

I'm running two 2TB NVMe drives for my cache pool.  After some random amount of writes, `btrfs dev stats /mnt/cache` will report corruption_errs on both drives.

 

When I scrub the cache, I will sometimes get checksum errors, sometimes not.  I don't get any other errors from the scrub.  The files in question pass their individual md5 checksums (if I have them) and I can safely copy them off the cache.  Another thing odd is that it's reporting only 186.35GiB total to scrub.  That is no where near the full 2TB but maybe I don't understand the scrub output.

 

UUID:             71382277-c2da-416b-86b5-6725b66b58d1
Scrub started:    Mon Nov 14 21:09:36 2022
Status:           finished
Duration:         0:00:33
Total to scrub:   186.35GiB
Rate:             6.00GiB/s
Error summary:    no errors found

 

I've disabled all overclocks on RAM and CPU on the motherboard, as far as I can tell.  It's an Asus ROG Maximus XIII Hero so if there's something I should be checking in the BIOS, let me know.

 

I've also run memtest86 against the RAM for ~12 hours with no errors.

 

Is there anything I can do to diagnose the cause of this?  At best, it results in `btrfs dev stats` reporting errors.  At worst, I've had my docker image get corrupt and not start, my VMs get corrupt and not start, and one time my entire cache pool got borked and I lost everything on it.

diagnostics-20221114-2118.zip

Edited by Hollandex
Link to comment
  • 1 month later...
On 11/15/2022 at 1:19 AM, JorgeB said:

Corruption in multiple devices is usually the result of a RAM issue, try with just one DIMM, if the same try with just the other one.

 

 

 

Pulled one stick out and have been running for a few days now with no corruption errors. Not sure if that means the stick I pulled out is bad or what. Figured I'll give it another few days and if there are still no corruption errors, I'll swap the sticks and see if it's a bad stick, or if it's something with running 2 sticks on the mobo.

  • Like 1
Link to comment
  • 3 weeks later...

Okay, ran with one DIMM for a while without any issue. Switched to the other DIMM, no issue for about 2 weeks. Then, tonight, I started getting corruption errors again. So I figured the DIMM was bad, swapped back to the "good" DIMM, and I'm still getting errors.

 

So now I'm not sure what's going on. Bad mobo?

Edited by Hollandex
Link to comment
On 11/14/2022 at 9:32 PM, Hollandex said:

I've disabled all overclocks on RAM and CPU on the motherboard, as far as I can tell. 

 

Overclocking doesn't just effect what you're overclocking; your mainboard has to handle it as well (all those connections trying to deal with the heat/power).

 

I don't have an answer, but I wouldn't trust anything after overclocking.

 

MrGrey.

Link to comment
10 minutes ago, MrGrey said:

 

Overclocking doesn't just effect what you're overclocking; your mainboard has to handle it as well (all those connections trying to deal with the heat/power).

 

I don't have an answer, but I wouldn't trust anything after overclocking.

 

MrGrey.

 

Yup. As the text you quoted says, "I've disabled all overclocks".

Link to comment

I swapped the RAM sticks, again, just to make sure I wasn't crazy. Corruption Errors started almost immediately after starting Unraid. This is how it happens every time. It's fine for a while, then they slowly start happening. And they get faster and more frequent until something gets really corrupt like my docker image. Then I have to format the drives completely and start the process all over.

 

Another interesting thing I've noticed is that the errors seem to double on one of the drives. For instance, here's my current dev stats output. It shows 4 errors on nvme0 and 8 on nvme1. It's not always double but it often is, or close to it.

 

[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  8
[/dev/nvme1n1p1].generation_errs  0
[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  4
[/dev/nvme0n1p1].generation_errs  0

Edited by Hollandex
Link to comment
15 hours ago, JorgeB said:

If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick.

 

Sorry, I realize what you're saying now. I ran a scrub and get 6 errors that it can't fix. The files are odd. Things like

usr/lib/freerdp2/libparallel-client.so

etc/ssl/certs/ca-certificates.crt

 

I'm not sure those files are even on the cache drives, are they?

Edited by Hollandex
Link to comment
56 minutes ago, apandey said:

Since the RAM is a potential suspect, have you tried running a long memtest session? You will need the download from memtest86 site if it's ECC RAM

 

Yeah, I ran memtest for about 12 hours a while back without any errors at all. This really seems to be a BTRF issue. Obviously, my hardware is playing some part in it but this RAM works fine in every other application. But when it comes to BTRFS, something isn't right.

 

I'm sort of at a point where I don't care about redundancy on my cache drive. It hasn't helped anyway. I've had my entire cache drives go corrupt and become unrecoverable, leading me to lose everything on them. So I'm tempted to just use XFS and have an aggressive backup schedule or something.

Link to comment
6 minutes ago, JorgeB said:

Never seen anything similar, suggest backing up and re-formatting the pool, then monitor for new errors, if new corruptions are found there's still some hardware issue, usually RAM related.

 

Yeah, I'm running Mover now to get everything off the cache so I can format it again. I might abandon BTRFS and go to XFS with an aggressive backup schedule or something. At the very least, I'm curious if I run in to any data corruption issues outside of BTRFS.

 

I appreciate all your help. I believe you've been the one replying to every help thread I've started. Thank you.

Link to comment
  • 11 months later...
On 12/26/2023 at 7:19 PM, Toepocalypse said:

Sorry to resurrect a dead thread - just checking in to see how things resolved. Did you move from btrfs to xfs for cache? Did that resolve corruption issues?

 

I switched to a single drive using xfs, with an aggressive backup schedule. I haven't a single corruption issues since I switched.

 

I'd like to move to zfs at some point, but I don't see a reason to yet. If it ain't broke...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...