Simom Posted January 30, 2023 Share Posted January 30, 2023 Hello! I am running a RAID1 SATA SSD Cache pool and I am getting some BTRFS errors: Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499960832 csum 0x60341ddd expected csum 0x88e58ce3 mirror 2 Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499964928 csum 0x1470dccc expected csum 0x8188ffff mirror 2 Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499960832 csum 0x60341ddd expected csum 0x88e58ce3 mirror 1 Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499964928 csum 0x1470dccc expected csum 0x8188ffff mirror 1 Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499960832 csum 0x60341ddd expected csum 0x88e58ce3 mirror 2 Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499964928 csum 0x1470dccc expected csum 0x8188ffff mirror 2 Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 As I had some issues with this pool (and another one) some time ago with similiar looking logs, I decided to fromat both drives about 2 days ago. I recreated the cache pool and moved the data back and not even after 48h I got the error above. I would be thankful if someone has an idea or can point me into a direction! (Diagnostics are attached) turing-diagnostics-20230130-0205.zip Quote Link to comment
JorgeB Posted January 30, 2023 Share Posted January 30, 2023 One of the pool devices dropped offline in the past: Jan 28 05:43:28 Turing kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 364810, rd 272, flush 32498, corrupt 169, gen 0 See here for more info and better pool monitoring. Quote Link to comment
Simom Posted January 30, 2023 Author Share Posted January 30, 2023 Hey, thanks for the response. You are kinda right. but that error is from another cache pool (my nvme one) (I am currently working on clearing the pool you mentioned, so I can scrub and format it). As far as I understand the error you mentioned shouldn't be connected to the one I posted about because the devices are part of different pools or am I missing something? Quote Link to comment
JorgeB Posted January 30, 2023 Share Posted January 30, 2023 4 minutes ago, Simom said: As far as I understand the error you mentioned shouldn't be connected to the one I posted Correct, missed that you had two pools, unexpected csum errors can be the result of RAM issues, suggest running memtest, still recommend the monitoring script for both pools. Quote Link to comment
Simom Posted January 30, 2023 Author Share Posted January 30, 2023 Alright I will try that! As I said I had a similiiar issue to this one some time ago but swapped CPU, MoBo and RAM since then. Nevertheless thanks for the help! Quote Link to comment
trurl Posted January 30, 2023 Share Posted January 30, 2023 31 minutes ago, JorgeB said: unexpected csum errors can be the result of RAM issues 23 minutes ago, Simom said: swapped CPU, MoBo and RAM In which case, the errors could have been the result of previous bad RAM. In any case, 32 minutes ago, JorgeB said: suggest running memtest Quote Link to comment
Simom Posted January 31, 2023 Author Share Posted January 31, 2023 18 hours ago, trurl said: In which case, the errors could have been the result of previous bad RAM. I think I didn't clearly state my previous trouble shooting steps and this is leading me to some confussion. So all in order: I had some csum erros like the one in my first comment I swapped systems with new CPU, MoBo and RAM unassigned both drives, formatted them and created a new pool not even 48h later I get a new csum error As I created a new pool, this still might be a problem with my current RAM, but not with my old one, or am I missing something? Memtest is running, nothing found this far. Any advice on how long I should leave this running (I read 24-48 hours somewhere)? Quote Link to comment
JorgeB Posted January 31, 2023 Share Posted January 31, 2023 3 minutes ago, Simom said: not even 48h later I get a new csum error That suggests there's still a problem, usually RAM related. Quote Link to comment
Simom Posted February 3, 2023 Author Share Posted February 3, 2023 just wanted to get back to this: memtest has been running for over 72hours without finding anything. are there any other trouble shooting that steps I can try? Quote Link to comment
JorgeB Posted February 3, 2023 Share Posted February 3, 2023 Remove one of the RAM sticks, scrub the pool and if no errors are found reset the filesystem stats, then work normally, if new errors appear try with just the other stick. Quote Link to comment
Simom Posted February 3, 2023 Author Share Posted February 3, 2023 Thanks for the quick response! I will try that and see how it goes. 1 Quote Link to comment
mrpainnogain Posted February 3 Share Posted February 3 On 2/4/2023 at 2:16 AM, Simom said: Thanks for the quick response! I will try that and see how it goes. Hey i am having similar issues with raid 1 config, how do you manage to move the data from the cache to array and move it back again? Since i am planning to re-format the nvme disks Quote Link to comment
trurl Posted February 3 Share Posted February 3 Nothing can move open files. Disable Docker and VM Manager in Settings. Dynamix File Manager will let you work directly with the disks and pools on the server. Quote Link to comment
Solution Simom Posted February 3 Author Solution Share Posted February 3 Just realized, that I never followed up to this: I switched from macvlan to ipvlan for my docker containers and that seems to have fixed it. No crashing, no corrupting since that. (I guess the macvlan stuff lead to kernel panics, that lead to the corruption of the files; but I am no expert). p.s. I also read that there have been changes to macvlan. Quote Link to comment
mrpainnogain Posted February 3 Share Posted February 3 3 hours ago, trurl said: Nothing can move open files. Disable Docker and VM Manager in Settings. Dynamix File Manager will let you work directly with the disks and pools on the server. So after i disable docker and vm, i can move the file from my cache to array? And after formatting the cache i can move the file back and use the vm and docker without additional settings? Quote Link to comment
Simom Posted February 3 Author Share Posted February 3 In theory yes, you should also make sure that no one else is accessing the files over smb, nfs or whatever before starting to move the files. But I would highly advise to check if the files system runs as expected before moving important data back to the cache. Quote Link to comment
mrpainnogain Posted February 4 Share Posted February 4 8 hours ago, Simom said: In theory yes, you should also make sure that no one else is accessing the files over smb, nfs or whatever before starting to move the files. But I would highly advise to check if the files system runs as expected before moving important data back to the cache. Ok, also do i need to re-direct the vm vdisk directory and other things for the vm? Or i can just hit play button after Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.