hundsboog Posted March 8, 2022 Share Posted March 8, 2022 (edited) Hey folks, recently I got an error on one device of my cache pool (2x2TB SSD). Mar 8 14:29:12 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdi1, physical 18878087168: metadata leaf (level 0) in tree 5 So i startet a scrub and checked the box to correct errors when possible. The output in the logs is this: Mar 8 14:29:12 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 Mar 8 14:29:13 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdh1 Mar 8 14:29:13 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdi1 So, since I have a cache pool of two identical devices, why do I have a uncorrectable error? Is there any way to solve this? Thank you in advance! P.s. Dont know if its from importancy, but I deleted orphaned docker images the day before i noticed it... And second, since I have a Raid1 Cache Pool with two devices, why cant the error not repaired? Edited March 8, 2022 by hundsboog addition Quote Link to comment
JorgeB Posted March 8, 2022 Share Posted March 8, 2022 Checksum errors mean btrfs is detecting data corruption, make sure you're RAM is not overclocked since is a known source of data corruption with Ryzen/Threadripper and/or run memtest. 1 Quote Link to comment
hundsboog Posted March 8, 2022 Author Share Posted March 8, 2022 @JorgeB, the page you linked was my source for my light night work yesterday. I have some really sporadically lockups, maybe once a month. This is caused around the Threadripper/RAM/C6 State problem. I really couldnt figure out a pattern when it crashes but its always around little more than idle. Yesterday I did another approach to get rid of it. So, what I did was to bring the RAM manually back to the Mhz your table supposed, set the power to "typical" and *enabled" the deep sleep feature which could possibly lead to those crashes. I found this comment on reddit: Quote Enabling deep sleep in bios, seems to have resulted in idle/sleep functioning properly under linux kernels 4.14 and 4.17 without having to disable any of the C6, S3, and Power Supply Idle functions in kernel or bios. The C6 states disabling should than be not necessary. By now, the server runs like expected but Im syslogging to the array anyways, because the lockup can happen anytime withing that month... So, I think like you supposed this is caused by some RAM error. All RAM stick are working, I memtested it without any errors. Is there a way to get rid of the BTFRS errors then? Move all data to the array, erase it and put it back? Do I have to do it, although it is a cache pool with two disks in Raid1? Would the corrupted data also be copied or is there a way to figure out which file is broke to delete it in advance? Thank you for you patience!! Quote Link to comment
JorgeB Posted March 8, 2022 Share Posted March 8, 2022 2 minutes ago, hundsboog said: The C6 states disabling should than be not necessary. Correct, like mentioned in the link, just set the correct power supply idle control, only if that option doesn't exist you should disable c-states. 6 minutes ago, hundsboog said: Is there a way to get rid of the BTFRS errors then? After a scrub the corrupt files(s) will be listed in the syslog, delete/replace from backups. Quote Link to comment
hundsboog Posted March 8, 2022 Author Share Posted March 8, 2022 6 minutes ago, JorgeB said: Correct, like mentioned in the link, just set the correct power supply idle control, only if that option doesn't exist you should disable c-states. After a scrub the corrupt files(s) will be listed in the syslog, delete/replace from backups. So this should be the culprits? Mar 8 14:54:28 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 908941897728 on dev /dev/sdi1, physical 752108482560, root 5, inode 6129850, offset 3596288, length 4096, links 1 (path: Nextcloud/appdata_ocies0pc9lnb/preview/b/b/1/7/d/1/6/109600/3398-3398-max.png) Mar 8 14:54:28 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Mar 8 14:54:28 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 908941897728 on dev /dev/sdi1 Mar 8 14:54:28 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 908941897728 on dev /dev/sdh1, physical 752087511040, root 5, inode 6129850, offset 3596288, length 4096, links 1 (path: Nextcloud/appdata_ocies0pc9lnb/preview/b/b/1/7/d/1/6/109600/3398-3398-max.png) Mar 8 14:54:28 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdh1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Mar 8 14:54:28 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 908941897728 on dev /dev/sdh1 @JorgeB thank you very much! It was a very awesome lesson to me and I learned a lot! Hopefully this thread will give other people also advice how to fix BTFRS errors!! Thank you so much! And I will report back, if the server runs now stabel with the config I did in the BIOS. Finger crossed, this was the the screw I had to fix to get the Mofo stable.... 😄 Quote Link to comment
JorgeB Posted March 8, 2022 Share Posted March 8, 2022 7 minutes ago, hundsboog said: So this should be the culprits? Yes. Quote Link to comment
hundsboog Posted March 8, 2022 Author Share Posted March 8, 2022 Ok, I deleted those two preview files and started scrub again, which was doing its thing. The error belonging to those were completely gone. This is now the final result, where I need still some assistance: Mar 8 16:41:31 Tower ool www[877]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' '' Mar 8 16:41:31 Tower kernel: BTRFS info (device sdi1): scrub: started on devid 1 Mar 8 16:41:31 Tower kernel: BTRFS info (device sdi1): scrub: started on devid 2 Mar 8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdh1, physical 18857115648: metadata leaf (level 0) in tree 5 Mar 8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdh1, physical 18857115648: metadata leaf (level 0) in tree 5 Mar 8 16:42:09 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdh1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 Mar 8 16:42:09 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdh1 Mar 8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdi1, physical 18878087168: metadata leaf (level 0) in tree 5 Mar 8 16:42:09 Tower kernel: BTRFS warning (device sdi1): checksum error at logical 35017768960 on dev /dev/sdi1, physical 18878087168: metadata leaf (level 0) in tree 5 Mar 8 16:42:09 Tower kernel: BTRFS error (device sdi1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 Mar 8 16:42:09 Tower kernel: BTRFS error (device sdi1): unable to fixup (regular) error at logical 35017768960 on dev /dev/sdi1 Quote Link to comment
JorgeB Posted March 8, 2022 Share Posted March 8, 2022 That's metadata corruption, for this it's best to backup and re-format the pool. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.