BTRFS error: corruption

Simom · January 30, 2023

Hello!

I am running a RAID1 SATA SSD Cache pool and I am getting some BTRFS errors:

Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499960832 csum 0x60341ddd expected csum 0x88e58ce3 mirror 2
Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499964928 csum 0x1470dccc expected csum 0x8188ffff mirror 2
Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499960832 csum 0x60341ddd expected csum 0x88e58ce3 mirror 1
Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499964928 csum 0x1470dccc expected csum 0x8188ffff mirror 1
Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499960832 csum 0x60341ddd expected csum 0x88e58ce3 mirror 2
Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Jan 29 23:57:12 Turing kernel: BTRFS warning (device sdd1): csum failed root 5 ino 3599 off 2499964928 csum 0x1470dccc expected csum 0x8188ffff mirror 2
Jan 29 23:57:12 Turing kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

As I had some issues with this pool (and another one) some time ago with similiar looking logs, I decided to fromat both drives about 2 days ago. I recreated the cache pool and moved the data back and not even after 48h I got the error above.

I would be thankful if someone has an idea or can point me into a direction!
(Diagnostics are attached)

turing-diagnostics-20230130-0205.zip

JorgeB · January 30, 2023

One of the pool devices dropped offline in the past:

Jan 28 05:43:28 Turing kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 364810, rd 272, flush 32498, corrupt 169, gen 0

See here for more info and better pool monitoring.

Simom · January 30, 2023

Hey, thanks for the response. You are kinda right. but that error is from another cache pool (my nvme one) (I am currently working on clearing the pool you mentioned, so I can scrub and format it).

As far as I understand the error you mentioned shouldn't be connected to the one I posted about because the devices are part of different pools or am I missing something?

JorgeB · January 30, 2023

4 minutes ago, Simom said:

As far as I understand the error you mentioned shouldn't be connected to the one I posted

Correct, missed that you had two pools, unexpected csum errors can be the result of RAM issues, suggest running memtest, still recommend the monitoring script for both pools.

Simom · January 30, 2023

Alright I will try that! As I said I had a similiiar issue to this one some time ago but swapped CPU, MoBo and RAM since then. Nevertheless thanks for the help!

trurl · January 30, 2023

31 minutes ago, JorgeB said:

unexpected csum errors can be the result of RAM issues

23 minutes ago, Simom said:

swapped CPU, MoBo and RAM

In which case, the errors could have been the result of previous bad RAM.

In any case,

32 minutes ago, JorgeB said:

suggest running memtest

Simom · January 31, 2023

18 hours ago, trurl said:

In which case, the errors could have been the result of previous bad RAM.

I think I didn't clearly state my previous trouble shooting steps and this is leading me to some confussion. So all in order:

I had some csum erros like the one in my first comment
I swapped systems with new CPU, MoBo and RAM
unassigned both drives, formatted them and created a new pool
not even 48h later I get a new csum error

As I created a new pool, this still might be a problem with my current RAM, but not with my old one, or am I missing something?

Memtest is running, nothing found this far. Any advice on how long I should leave this running (I read 24-48 hours somewhere)?

JorgeB · January 31, 2023

3 minutes ago, Simom said:

not even 48h later I get a new csum error

That suggests there's still a problem, usually RAM related.

Simom · February 3, 2023

just wanted to get back to this:
memtest has been running for over 72hours without finding anything.

are there any other trouble shooting that steps I can try?

JorgeB · February 3, 2023

Remove one of the RAM sticks, scrub the pool and if no errors are found reset the filesystem stats, then work normally, if new errors appear try with just the other stick.

Simom · February 3, 2023

Thanks for the quick response! I will try that and see how it goes.

mrpainnogain · February 3

On 2/4/2023 at 2:16 AM, Simom said:

Thanks for the quick response! I will try that and see how it goes.

Hey i am having similar issues with raid 1 config, how do you manage to move the data from the cache to array and move it back again? Since i am planning to re-format the nvme disks

trurl · February 3

Nothing can move open files. Disable Docker and VM Manager in Settings. Dynamix File Manager will let you work directly with the disks and pools on the server.

Simom · February 3

Just realized, that I never followed up to this:
I switched from macvlan to ipvlan for my docker containers and that seems to have fixed it. No crashing, no corrupting since that.

(I guess the macvlan stuff lead to kernel panics, that lead to the corruption of the files; but I am no expert).

p.s.

I also read that there have been changes to macvlan.

mrpainnogain · February 3

3 hours ago, trurl said:

Nothing can move open files. Disable Docker and VM Manager in Settings. Dynamix File Manager will let you work directly with the disks and pools on the server.

So after i disable docker and vm, i can move the file from my cache to array? And after formatting the cache i can move the file back and use the vm and docker without additional settings?

Simom · February 3

In theory yes, you should also make sure that no one else is accessing the files over smb, nfs or whatever before starting to move the files.
But I would highly advise to check if the files system runs as expected before moving important data back to the cache.

mrpainnogain · February 4

8 hours ago, Simom said:

In theory yes, you should also make sure that no one else is accessing the files over smb, nfs or whatever before starting to move the files.
But I would highly advise to check if the files system runs as expected before moving important data back to the cache.

Ok, also do i need to re-direct the vm vdisk directory and other things for the vm? Or i can just hit play button after

BTRFS error: corruption

Recommended Posts

Simom

Link to comment

JorgeB

Link to comment

Simom

Link to comment

JorgeB

Link to comment

Simom

Link to comment

trurl

Link to comment

Simom

Link to comment

JorgeB

Link to comment

Simom

Link to comment

JorgeB

Link to comment

Simom

Link to comment

mrpainnogain

Link to comment

trurl

Link to comment

Simom

Link to comment

mrpainnogain

Link to comment

Simom

Link to comment

mrpainnogain

Link to comment

Join the conversation