BTRFS error CSUM

hundsboog · March 8, 2022

Hey folks,

recently I got an error on one device of my cache pool (2x2TB SSD).

Mar 8 14:29:12 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdi1, physical 18878087168: metadata leaf (level 0) in tree 5

So i startet a scrub and checked the box to correct errors when possible. The output in the logs is this:

Mar 8 14:29:12 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Mar 8 14:29:13 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdh1
Mar 8 14:29:13 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdi1

So, since I have a cache pool of two identical devices, why do I have a uncorrectable error? Is there any way to solve this?

Thank you in advance!

P.s. Dont know if its from importancy, but I deleted orphaned docker images the day before i noticed it... And second, since I have a Raid1 Cache Pool with two devices, why cant the error not repaired?

Edited March 8, 2022 by hundsboog
addition

JorgeB · March 8, 2022

Checksum errors mean btrfs is detecting data corruption, make sure you're RAM is not overclocked since is a known source of data corruption with Ryzen/Threadripper and/or run memtest.

hundsboog · March 8, 2022

@JorgeB, the page you linked was my source for my light night work yesterday. I have some really sporadically lockups, maybe once a month. This is caused around the Threadripper/RAM/C6 State problem. I really couldnt figure out a pattern when it crashes but its always around little more than idle. Yesterday I did another approach to get rid of it.

So, what I did was to bring the RAM manually back to the Mhz your table supposed, set the power to "typical" and *enabled" the deep sleep feature which could possibly lead to those crashes. I found this comment on reddit:

Quote

Enabling deep sleep in bios, seems to have resulted in idle/sleep functioning properly under linux kernels 4.14 and 4.17 without having to disable any of the C6, S3, and Power Supply Idle functions in kernel or bios.

The C6 states disabling should than be not necessary. By now, the server runs like expected but Im syslogging to the array anyways, because the lockup can happen anytime withing that month...

So, I think like you supposed this is caused by some RAM error. All RAM stick are working, I memtested it without any errors. Is there a way to get rid of the BTFRS errors then? Move all data to the array, erase it and put it back? Do I have to do it, although it is a cache pool with two disks in Raid1? Would the corrupted data also be copied or is there a way to figure out which file is broke to delete it in advance?

Thank you for you patience!!

JorgeB · March 8, 2022

2 minutes ago, hundsboog said:

The C6 states disabling should than be not necessary.

Correct, like mentioned in the link, just set the correct power supply idle control, only if that option doesn't exist you should disable c-states.

6 minutes ago, hundsboog said:

Is there a way to get rid of the BTFRS errors then?

After a scrub the corrupt files(s) will be listed in the syslog, delete/replace from backups.

hundsboog · March 8, 2022

6 minutes ago, JorgeB said:

Correct, like mentioned in the link, just set the correct power supply idle control, only if that option doesn't exist you should disable c-states.

After a scrub the corrupt files(s) will be listed in the syslog, delete/replace from backups.

So this should be the culprits?

Mar  8 14:54:28 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 908941897728 on dev /dev/sdi1, physical 752108482560, root 5, inode 6129850, offset 3596288, length 4096, links 1 (path: Nextcloud/appdata_ocies0pc9lnb/preview/b/b/1/7/d/1/6/109600/3398-3398-max.png)
Mar  8 14:54:28 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Mar  8 14:54:28 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 908941897728 on dev /dev/sdi1
Mar  8 14:54:28 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 908941897728 on dev /dev/sdh1, physical 752087511040, root 5, inode 6129850, offset 3596288, length 4096, links 1 (path: Nextcloud/appdata_ocies0pc9lnb/preview/b/b/1/7/d/1/6/109600/3398-3398-max.png)
Mar  8 14:54:28 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdh1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Mar  8 14:54:28 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 908941897728 on dev /dev/sdh1

@JorgeB thank you very much! It was a very awesome lesson to me and I learned a lot! Hopefully this thread will give other people also advice how to fix BTFRS errors!! Thank you so much!

And I will report back, if the server runs now stabel with the config I did in the BIOS. Finger crossed, this was the the screw I had to fix to get the Mofo stable.... 😄

JorgeB · March 8, 2022

7 minutes ago, hundsboog said:

So this should be the culprits?

Yes.

hundsboog · March 8, 2022

Ok, I deleted those two preview files and started scrub again, which was doing its thing. The error belonging to those were completely gone.

This is now the final result, where I need still some assistance:

Mar  8 16:41:31 Tower ool www[877]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' ''
Mar  8 16:41:31 Tower kernel: BTRFS info (device sdi1): scrub: started on devid 1
Mar  8 16:41:31 Tower kernel: BTRFS info (device sdi1): scrub: started on devid 2
Mar  8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdh1, physical 18857115648: metadata leaf (level 0) in tree 5
Mar  8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdh1, physical 18857115648: metadata leaf (level 0) in tree 5
Mar  8 16:42:09 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdh1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Mar  8 16:42:09 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdh1
Mar  8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdi1, physical 18878087168: metadata leaf (level 0) in tree 5
Mar  8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdi1, physical 18878087168: metadata leaf (level 0) in tree 5
Mar  8 16:42:09 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Mar  8 16:42:09 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdi1

JorgeB · March 8, 2022

That's metadata corruption, for this it's best to backup and re-format the pool.

BTRFS error CSUM

Recommended Posts

hundsboog

Link to comment

JorgeB

Link to comment

hundsboog

Link to comment

JorgeB

Link to comment

hundsboog

Link to comment

JorgeB

Link to comment

hundsboog

Link to comment

JorgeB

Link to comment

Join the conversation