Hollandex Posted November 15, 2022 Share Posted November 15, 2022 (edited) I'm running two 2TB NVMe drives for my cache pool. After some random amount of writes, `btrfs dev stats /mnt/cache` will report corruption_errs on both drives. When I scrub the cache, I will sometimes get checksum errors, sometimes not. I don't get any other errors from the scrub. The files in question pass their individual md5 checksums (if I have them) and I can safely copy them off the cache. Another thing odd is that it's reporting only 186.35GiB total to scrub. That is no where near the full 2TB but maybe I don't understand the scrub output. UUID: 71382277-c2da-416b-86b5-6725b66b58d1 Scrub started: Mon Nov 14 21:09:36 2022 Status: finished Duration: 0:00:33 Total to scrub: 186.35GiB Rate: 6.00GiB/s Error summary: no errors found I've disabled all overclocks on RAM and CPU on the motherboard, as far as I can tell. It's an Asus ROG Maximus XIII Hero so if there's something I should be checking in the BIOS, let me know. I've also run memtest86 against the RAM for ~12 hours with no errors. Is there anything I can do to diagnose the cause of this? At best, it results in `btrfs dev stats` reporting errors. At worst, I've had my docker image get corrupt and not start, my VMs get corrupt and not start, and one time my entire cache pool got borked and I lost everything on it. diagnostics-20221114-2118.zip Edited November 15, 2022 by Hollandex Quote Link to comment
JorgeB Posted November 15, 2022 Share Posted November 15, 2022 3 hours ago, Hollandex said: it's reporting only 186.35GiB total to scrub. That is normal, only used capacity is scrubbed. Corruption in multiple devices is usually the result of a RAM issue, try with just one DIMM, if the same try with just the other one. Quote Link to comment
Hollandex Posted December 18, 2022 Author Share Posted December 18, 2022 On 11/15/2022 at 1:19 AM, JorgeB said: Corruption in multiple devices is usually the result of a RAM issue, try with just one DIMM, if the same try with just the other one. Pulled one stick out and have been running for a few days now with no corruption errors. Not sure if that means the stick I pulled out is bad or what. Figured I'll give it another few days and if there are still no corruption errors, I'll swap the sticks and see if it's a bad stick, or if it's something with running 2 sticks on the mobo. 1 Quote Link to comment
Hollandex Posted January 9, 2023 Author Share Posted January 9, 2023 (edited) Okay, ran with one DIMM for a while without any issue. Switched to the other DIMM, no issue for about 2 weeks. Then, tonight, I started getting corruption errors again. So I figured the DIMM was bad, swapped back to the "good" DIMM, and I'm still getting errors. So now I'm not sure what's going on. Bad mobo? Edited January 9, 2023 by Hollandex Quote Link to comment
MrGrey Posted January 9, 2023 Share Posted January 9, 2023 On 11/14/2022 at 9:32 PM, Hollandex said: I've disabled all overclocks on RAM and CPU on the motherboard, as far as I can tell. Overclocking doesn't just effect what you're overclocking; your mainboard has to handle it as well (all those connections trying to deal with the heat/power). I don't have an answer, but I wouldn't trust anything after overclocking. MrGrey. Quote Link to comment
Hollandex Posted January 9, 2023 Author Share Posted January 9, 2023 10 minutes ago, MrGrey said: Overclocking doesn't just effect what you're overclocking; your mainboard has to handle it as well (all those connections trying to deal with the heat/power). I don't have an answer, but I wouldn't trust anything after overclocking. MrGrey. Yup. As the text you quoted says, "I've disabled all overclocks". Quote Link to comment
JorgeB Posted January 9, 2023 Share Posted January 9, 2023 3 hours ago, Hollandex said: So I figured the DIMM was bad, swapped back to the "good" DIMM, and I'm still getting errors. If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick. Quote Link to comment
Hollandex Posted January 9, 2023 Author Share Posted January 9, 2023 14 hours ago, JorgeB said: If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick. Yup, I did that. I pass the -z flag every time I run it. Quote Link to comment
Hollandex Posted January 9, 2023 Author Share Posted January 9, 2023 (edited) I swapped the RAM sticks, again, just to make sure I wasn't crazy. Corruption Errors started almost immediately after starting Unraid. This is how it happens every time. It's fine for a while, then they slowly start happening. And they get faster and more frequent until something gets really corrupt like my docker image. Then I have to format the drives completely and start the process all over. Another interesting thing I've noticed is that the errors seem to double on one of the drives. For instance, here's my current dev stats output. It shows 4 errors on nvme0 and 8 on nvme1. It's not always double but it often is, or close to it. [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 8 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 4 [/dev/nvme0n1p1].generation_errs 0 Edited January 10, 2023 by Hollandex Quote Link to comment
Hollandex Posted January 10, 2023 Author Share Posted January 10, 2023 (edited) 15 hours ago, JorgeB said: If the other DIMM was bad it will corrupt the fs/data, you need to fix/clear the errors before returning to the good stick. Sorry, I realize what you're saying now. I ran a scrub and get 6 errors that it can't fix. The files are odd. Things like usr/lib/freerdp2/libparallel-client.so etc/ssl/certs/ca-certificates.crt I'm not sure those files are even on the cache drives, are they? Edited January 10, 2023 by Hollandex Quote Link to comment
JorgeB Posted January 10, 2023 Share Posted January 10, 2023 Post new diags after running a scrub. Quote Link to comment
Hollandex Posted January 10, 2023 Author Share Posted January 10, 2023 11 hours ago, JorgeB said: Post new diags after running a scrub. sanctuary-diagnostics-20230110-1228.zip Quote Link to comment
JorgeB Posted January 11, 2023 Share Posted January 11, 2023 That's strange, I assume no etc folder exists on cache? Quote Link to comment
apandey Posted January 11, 2023 Share Posted January 11, 2023 Since the RAM is a potential suspect, have you tried running a long memtest session? You will need the download from memtest86 site if it's ECC RAM Quote Link to comment
Hollandex Posted January 11, 2023 Author Share Posted January 11, 2023 7 hours ago, JorgeB said: That's strange, I assume no etc folder exists on cache? Correct. At the moment, all the cache has on it is domains, system, and appdata. No other directories. Quote Link to comment
JorgeB Posted January 11, 2023 Share Posted January 11, 2023 Never seen anything similar, suggest backing up and re-formatting the pool, then monitor for new errors, if new corruptions are found there's still some hardware issue, usually RAM related. Quote Link to comment
Hollandex Posted January 11, 2023 Author Share Posted January 11, 2023 56 minutes ago, apandey said: Since the RAM is a potential suspect, have you tried running a long memtest session? You will need the download from memtest86 site if it's ECC RAM Yeah, I ran memtest for about 12 hours a while back without any errors at all. This really seems to be a BTRF issue. Obviously, my hardware is playing some part in it but this RAM works fine in every other application. But when it comes to BTRFS, something isn't right. I'm sort of at a point where I don't care about redundancy on my cache drive. It hasn't helped anyway. I've had my entire cache drives go corrupt and become unrecoverable, leading me to lose everything on them. So I'm tempted to just use XFS and have an aggressive backup schedule or something. Quote Link to comment
Hollandex Posted January 11, 2023 Author Share Posted January 11, 2023 6 minutes ago, JorgeB said: Never seen anything similar, suggest backing up and re-formatting the pool, then monitor for new errors, if new corruptions are found there's still some hardware issue, usually RAM related. Yeah, I'm running Mover now to get everything off the cache so I can format it again. I might abandon BTRFS and go to XFS with an aggressive backup schedule or something. At the very least, I'm curious if I run in to any data corruption issues outside of BTRFS. I appreciate all your help. I believe you've been the one replying to every help thread I've started. Thank you. Quote Link to comment
Toepocalypse Posted December 27, 2023 Share Posted December 27, 2023 Sorry to resurrect a dead thread - just checking in to see how things resolved. Did you move from btrfs to xfs for cache? Did that resolve corruption issues? Best Quote Link to comment
Hollandex Posted December 29, 2023 Author Share Posted December 29, 2023 On 12/26/2023 at 7:19 PM, Toepocalypse said: Sorry to resurrect a dead thread - just checking in to see how things resolved. Did you move from btrfs to xfs for cache? Did that resolve corruption issues? I switched to a single drive using xfs, with an aggressive backup schedule. I haven't a single corruption issues since I switched. I'd like to move to zfs at some point, but I don't see a reason to yet. If it ain't broke... Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.