BTRFS warning: csum failed

ryoko227 · September 27, 2018

This and a multiple varients of this error are getting spammed across my system log since updating to 6.6.0

BTRFS warning (device sdb1): csum failed root 5 ino 274 off 62905892864 csum 0x2d8eafc5 expected csum 0xea417956 mirror 1

The device in question is a singular SSD mounted outide the array via the go file at startup.

Running stats on the device results in this

[/dev/sdb1].write_io_errs    0
[/dev/sdb1].read_io_errs     0
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0

Scrubbing the device also gives 0 errors.

scrub started at Thu Sep 27 14:24:13 2018 and finished after 00:05:14
total bytes scrubbed: 153.88GiB with 0 errors

I have copied all of the data off and reformated the device. (This did have the added benefit of UD plug-in now showing the: FS, temp, and capacity).

However, the errors continue.

I've read quite a few posts related to this, but still haven't found a solution on my own.

I admittedly don't know enough about the btrfs file system to understand which file(s)/metadata are causing the error.

Any help would be greatly appreciated.

Thank you in advance o/

yes-mediaserver-diagnostics-20180927-1436.zip

JorgeB · September 27, 2018

Those are checksum errors and if you keep getting them after reformatting there's likely a hardware problem, like bad RAM.

ryoko227 · September 28, 2018

Should I try running memtest or something to troubleshoot it further? I'm thinking that might help with trying to sort out which piece of hardware might be causing it. I'm also going to check and see if there is a new BIOS for this specific motherboard as well. Thank you as always for your help johnnie!

EDIT-

Updated the BIOS as it specifically had increased memory compatiblity listed. Also took the time to clean the dust out and reseat everything. Still popping the error. I'll run memtest on it later when people aren't using it and see if that identifies/finds the issue.

Edited September 28, 2018 by ryoko227
Added more information

JorgeB · September 28, 2018

You should run memtest, note that exiting errors can't be fixed, you'll need to reformat or replace all the corrupt files.

ryoko227 · October 23, 2018

Just a follow up to this, since I thought it was weird ~~and also so when I forget what I did I can check back later, www~~

So I never got around to running the memtest, partially out of laziness, mostly out of not wanting to stay after hours.

I noticed yesterday, that the mounting point for the drive no longer seemed to have any files or folders in it.

So I tried copying the backups over just to see, and it said the disk was full...

I decided I would try reformating the drive (again just to see) and pulled the mount commands from the go file and rebooted.

Drive showed up totally fine in Unassigned Devices, could mount with no issues, and all the files and folders were still there.

So, I set it to automount, changed my VM settings to point to this new mounting point, and profit.

EDIT - Even with mutliple reboots nothing has changed... weird

Everything seems to be running fine and I have had 0 checksum errors since.

Keeping in mind, I did the 6.6.3 update right before all of this. So TBH, I don't know what "fixed" it, or even what was ultimately wrong with it.

I know I should still run the mem test to verify

Edited October 23, 2018 by ryoko227
Added more information

Marshalleq · February 25, 2020

Hey, did you ever find out anything further? I've just started getting them too. Personally I suspect it's a typical BTRFS issue - I've heard more than one person say BTRFS often corrupts. But hey, what do I know. It's only BTRFS complaining though, not XFS and not ZFS.

JorgeB · February 25, 2020

7 hours ago, Marshalleq said:

I've heard more than one person say BTRFS often corrupts.

I won't argue that btrfs doesn't have its bugs, but I've been using it for a long time as well as following the development on the mailing list and never heard of any data checksum related bug, that feature is pretty much bullet proof, i.e., if there's a checksum error data doesn't match the checksum stored at write time, this happens most often in Unraid with raid based pools when one of the members dropped offline and then comes back online, the old data will be stale and fail checksums, scrub will bring up to date, but if this is happening on a single device filesystem then you can be pretty sure data corruption occurred, or there's a hardware problem, like bad RAM.

7 hours ago, Marshalleq said:

It's only BTRFS complaining though, not XFS

Well, XFS would never complain, since it doesn't ckecskum data, it will happy feed you corrupted data.

I also use ZFS for a couple of servers, no doubt more stable than btrfs, but it's not perfect, and not as flexible.

Marshalleq · February 25, 2020

4 hours ago, johnnie.black said:

I won't argue that btrfs doesn't have its bugs, if this is happening on a single device filesystem then you can be pretty sure data corruption occurred, or there's a hardware problem, like bad RAM.

It's a BTRFS mirror. I wouldn't have chosen BTRFS if I had any other choice. Of the space of the few years I've been using it I've had 3 maybe 4 issues, all different. I think one other where I needed to format and start again, and two related to the mirror not being created as advertised - that's probably an unraid bug more than a btrfs bug though.

4 hours ago, johnnie.black said:

Well, XFS would never complain, since it doesn't ckecskum data, it will happy feed you corrupted data.

The context was provided to indicate it wasn't likely a whole system issue impacting disks as one other person on here seemed to think when I dug back into the archives.

I'll just invoke the mover and reformat it. I'm half tempted to get rid of the mirror altogether as I think a single disk will have less issues than BTRFS will. But provided it doesn't actually corrupt my data, I'll keep the mirror. Though the jury's out on that one.

JorgeB · February 25, 2020

57 minutes ago, Marshalleq said:

It's a BTRFS mirror.

Then and like mentioned the most likely reason for checksum errors would be one of the devices having dropped offline for some time and then rejoined the pool, if you post the diagnostics we could confirm if that was the case.

Marshalleq · February 25, 2020

Thanks but it has impacted both BTRFS on the mirror and the btrfs on the docker image. I assume it must have been caused when I had to force restart the box the other day due to an Nvidia lockup. I don't really expect otherwise, but btrfs does seem to be more picky about such things and I would have thought a scrub or repair would have sorted it. But no.

JorgeB · February 25, 2020

1 hour ago, Marshalleq said:

on the mirror and the btrfs on the docker image.

Just FYI docker image share on Unraid by default is NOCOW, so checksums are disabled and any corruption can't be fixed with a scrub.

Marshalleq · February 26, 2020

Thanks, the saga continues. Fixed all this up, but now getting transport endpoint is not connected on /mnt/user. Also lots of segfaults when docker tries to run and the shares have all disappeared. Fun times. I officially dislike BTRFS - even though I recognise it may not be entirely the fault of BTRFS - I have to blame something until I know better!

Edited February 26, 2020 by Marshalleq

JorgeB · March 2, 2020

On 2/26/2020 at 7:37 PM, Marshalleq said:

I officially dislike BTRFS - even though I recognise it may not be entirely the fault of BTRFS - I have to blame something until I know better!

Just for any future reader I see like suspected this wasn't a btrfs problem:

https://forums.unraid.net/topic/41333-zfs-plugin-for-unraid/?do=findComment&comment=828132

On 2/29/2020 at 7:21 PM, Marshalleq said:

Edit 2: Memtest confirms I have faulty, or possibly misconfigured memory. There goes my morning....

Marshalleq · March 2, 2020

Yep, your gut was right, it was a faulty memory stick. Though this exercise I have learnt the following:

I don't know how long the memory was faulty - my assumption is many months, it was at about 79GB so it may not always have been used. Also that:

All file systems were impacted
The BTRFS file system required to be formatted to be recovered because it wouldn't rebalance.
The BTRFS docker image required to be deleted and recreated (or older version restored from backup) because it wouldn't repair
ZFS clearly pointed me directly at the corrupted file (on my single disk ZFS volume) so I could restore it which was nice
The ZFS mirror healed with a simple scrub.
XFS of course just ran an fsck type thing, so something could still be lingering, but that data is not very important, that's why it's on XFS. I'd like something more robust, but short of memory errors like this and cold reboots, it's probably pretty safe for what it is.

I do believe BTRFS will also point me at the corrupted file, maybe it did, my memory on that is struggling.

It has been a good exercise. It's possible someone with more knowledge of BTRFS could have fixed it, though a corrupted image file didn't sound very fixable to me so I elected to start again because I don't trust it based on past experience. I could be completely wrong, but that's where I landed.

BTRFS warning: csum failed

Recommended Posts

ryoko227

Link to comment

JorgeB

Link to comment

ryoko227

Link to comment

JorgeB

Link to comment

ryoko227

Link to comment

Marshalleq

Link to comment

JorgeB

Link to comment

Marshalleq

Link to comment

JorgeB

Link to comment

Marshalleq

Link to comment

JorgeB

Link to comment

Marshalleq

Link to comment

JorgeB

Link to comment

Marshalleq

Link to comment

Join the conversation