constant btrfs corruption errors. would like some advice.

Hollandex · November 15, 2022

I'm running two 2TB NVMe drives for my cache pool. After some random amount of writes, `btrfs dev stats /mnt/cache` will report corruption_errs on both drives.

When I scrub the cache, I will sometimes get checksum errors, sometimes not. I don't get any other errors from the scrub. The files in question pass their individual md5 checksums (if I have them) and I can safely copy them off the cache. Another thing odd is that it's reporting only 186.35GiB total to scrub. That is no where near the full 2TB but maybe I don't understand the scrub output.

UUID:             71382277-c2da-416b-86b5-6725b66b58d1
Scrub started:    Mon Nov 14 21:09:36 2022
Status:           finished
Duration:         0:00:33
Total to scrub:   186.35GiB
Rate:             6.00GiB/s
Error summary:    no errors found

I've disabled all overclocks on RAM and CPU on the motherboard, as far as I can tell. It's an Asus ROG Maximus XIII Hero so if there's something I should be checking in the BIOS, let me know.

I've also run memtest86 against the RAM for ~12 hours with no errors.

Is there anything I can do to diagnose the cause of this? At best, it results in `btrfs dev stats` reporting errors. At worst, I've had my docker image get corrupt and not start, my VMs get corrupt and not start, and one time my entire cache pool got borked and I lost everything on it.

diagnostics-20221114-2118.zip

Edited November 15, 2022 by Hollandex

JorgeB · November 15, 2022

3 hours ago, Hollandex said:

it's reporting only 186.35GiB total to scrub.

That is normal, only used capacity is scrubbed.

Corruption in multiple devices is usually the result of a RAM issue, try with just one DIMM, if the same try with just the other one.

Hollandex · December 18, 2022

On 11/15/2022 at 1:19 AM, JorgeB said:

Corruption in multiple devices is usually the result of a RAM issue, try with just one DIMM, if the same try with just the other one.

Pulled one stick out and have been running for a few days now with no corruption errors. Not sure if that means the stick I pulled out is bad or what. Figured I'll give it another few days and if there are still no corruption errors, I'll swap the sticks and see if it's a bad stick, or if it's something with running 2 sticks on the mobo.

Hollandex · January 9, 2023

Okay, ran with one DIMM for a while without any issue. Switched to the other DIMM, no issue for about 2 weeks. Then, tonight, I started getting corruption errors again. So I figured the DIMM was bad, swapped back to the "good" DIMM, and I'm still getting errors.

So now I'm not sure what's going on. Bad mobo?

Edited January 9, 2023 by Hollandex

MrGrey · January 9, 2023

On 11/14/2022 at 9:32 PM, Hollandex said:

I've disabled all overclocks on RAM and CPU on the motherboard, as far as I can tell.

Overclocking doesn't just effect what you're overclocking; your mainboard has to handle it as well (all those connections trying to deal with the heat/power).

I don't have an answer, but I wouldn't trust anything after overclocking.

MrGrey.

Hollandex · January 9, 2023

10 minutes ago, MrGrey said:

Overclocking doesn't just effect what you're overclocking; your mainboard has to handle it as well (all those connections trying to deal with the heat/power).

I don't have an answer, but I wouldn't trust anything after overclocking.

MrGrey.

Yup. As the text you quoted says, "I've disabled all overclocks".

JorgeB · January 9, 2023

3 hours ago, Hollandex said:

So I figured the DIMM was bad, swapped back to the "good" DIMM, and I'm still getting errors.

If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick.

Hollandex · January 9, 2023

14 hours ago, JorgeB said:

If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick.

Yup, I did that. I pass the -z flag every time I run it.

Hollandex · January 9, 2023

I swapped the RAM sticks, again, just to make sure I wasn't crazy. Corruption Errors started almost immediately after starting Unraid. This is how it happens every time. It's fine for a while, then they slowly start happening. And they get faster and more frequent until something gets really corrupt like my docker image. Then I have to format the drives completely and start the process all over.

Another interesting thing I've noticed is that the errors seem to double on one of the drives. For instance, here's my current dev stats output. It shows 4 errors on nvme0 and 8 on nvme1. It's not always double but it often is, or close to it.

[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs 8
[/dev/nvme1n1p1].generation_errs 0
[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs 4
[/dev/nvme0n1p1].generation_errs 0

Edited January 10, 2023 by Hollandex

Hollandex · January 10, 2023

15 hours ago, JorgeB said:

If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick.

Sorry, I realize what you're saying now. I ran a scrub and get 6 errors that it can't fix. The files are odd. Things like

usr/lib/freerdp2/libparallel-client.so

etc/ssl/certs/ca-certificates.crt

I'm not sure those files are even on the cache drives, are they?

Edited January 10, 2023 by Hollandex

JorgeB · January 10, 2023

Post new diags after running a scrub.

Hollandex · January 10, 2023

11 hours ago, JorgeB said:

Post new diags after running a scrub.

sanctuary-diagnostics-20230110-1228.zip

JorgeB · January 11, 2023

That's strange, I assume no etc folder exists on cache?

apandey · January 11, 2023

Since the RAM is a potential suspect, have you tried running a long memtest session? You will need the download from memtest86 site if it's ECC RAM

Hollandex · January 11, 2023

7 hours ago, JorgeB said:

That's strange, I assume no etc folder exists on cache?

Correct. At the moment, all the cache has on it is domains, system, and appdata. No other directories.

JorgeB · January 11, 2023

Never seen anything similar, suggest backing up and re-formatting the pool, then monitor for new errors, if new corruptions are found there's still some hardware issue, usually RAM related.

Hollandex · January 11, 2023

56 minutes ago, apandey said:

Since the RAM is a potential suspect, have you tried running a long memtest session? You will need the download from memtest86 site if it's ECC RAM

Yeah, I ran memtest for about 12 hours a while back without any errors at all. This really seems to be a BTRF issue. Obviously, my hardware is playing some part in it but this RAM works fine in every other application. But when it comes to BTRFS, something isn't right.

I'm sort of at a point where I don't care about redundancy on my cache drive. It hasn't helped anyway. I've had my entire cache drives go corrupt and become unrecoverable, leading me to lose everything on them. So I'm tempted to just use XFS and have an aggressive backup schedule or something.

Hollandex · January 11, 2023

6 minutes ago, JorgeB said:

Never seen anything similar, suggest backing up and re-formatting the pool, then monitor for new errors, if new corruptions are found there's still some hardware issue, usually RAM related.

Yeah, I'm running Mover now to get everything off the cache so I can format it again. I might abandon BTRFS and go to XFS with an aggressive backup schedule or something. At the very least, I'm curious if I run in to any data corruption issues outside of BTRFS.

I appreciate all your help. I believe you've been the one replying to every help thread I've started. Thank you.

Toepocalypse · December 27, 2023

Sorry to resurrect a dead thread - just checking in to see how things resolved. Did you move from btrfs to xfs for cache? Did that resolve corruption issues?

Best

Hollandex · December 29, 2023

On 12/26/2023 at 7:19 PM, Toepocalypse said:

Sorry to resurrect a dead thread - just checking in to see how things resolved. Did you move from btrfs to xfs for cache? Did that resolve corruption issues?

I switched to a single drive using xfs, with an aggressive backup schedule. I haven't a single corruption issues since I switched.

I'd like to move to zfs at some point, but I don't see a reason to yet. If it ain't broke...

constant btrfs corruption errors. would like some advice.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation