BTRFS csum failed ino

February 11, 20179 yr

I am seeing the following errors in my log:

Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506032128 csum 1577116343 expected csum 2990069891
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506732544 csum 3515546368 expected csum 135346473
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 507781120 csum 1284283823 expected csum 1771133338
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 508829696 csum 755813392 expected csum 3990879880
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 509964288 csum 4163699914 expected csum 4089351303
Feb 10 18:55:48 Tower shfs/user: err: shfs_read: read: (5) Input/output error
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 510926848 csum 24089256 expected csum 1351376558

(for those reading other posts, this was already happening before I exchanged my sata cards)

MD9 is one of my array drives, they are BTRFS.

I am now running a scrub on it to see if it will find errors.

Is this some kind of data rot it is detecting ? The drive is fine smart-wise ..

Quote

February 11, 20179 yr

Community Expert

Looks like it. you can also run:

btrfs dev stats /mnt/diskX

It will show all accumulated errors and what type they are.

Quote

February 11, 20179 yr

Author

[/dev/md9].write_io_errs   0
[/dev/md9].read_io_errs    0
[/dev/md9].flush_io_errs   0
[/dev/md9].corruption_errs 1728
[/dev/md9].generation_errs 0

Which is:

corruption_errs

A block checksum mismatched or a corrupted metadata header was found

I checked all devices (really nice command) and all other disks are fine.. I am now running a non-correcting scrub and will then run a correcting scrub..

Quote

February 11, 20179 yr

Community Expert

Correcting scrub won't work on a single device.

Quote

February 11, 20179 yr

Author

Correcting scrub won't work on a single device.

I just read up on that.. you are right... basically the scrub onlyu works when you have some kind of parity or duplication going on...

Any idea what route to take then ?

Quote

February 11, 20179 yr

Community Expert

In this case checkums are useful to alert you to the problem, you'll need to restore affected files from backups or other sources.

If disk9 was one of the disks rebuilt during your recent controller issues you may want to run a scrub on the other rebuilt disks.

Quote

February 11, 20179 yr

Author

In this case checkums are useful to alert you to the problem, you'll need to restore affected files from backups or other sources.

If disk9 was one of the disks rebuilt during your recent controller issues you may want to run a scrub on the other rebuilt disks.

I get that and I have backups... Only thing I do not know is how to make the correlation between the corruption found and the specific file.. I am not running specific checksumming dockers..

If that is what is needed I will not be able to fix the issue and would like to "correct" the filesystem to reflect the current (and possible corrupt) situation.. Any idea how to do this ?

Quote

February 11, 20179 yr

Community Expert

IIRC the syslog should show the corrupt files as scrub detects them.

Quote

February 11, 20179 yr

Author

IIRC the syslog should show the corrupt files as scrub detects them.

Cool, thanks !

Quote

February 11, 20179 yr

Author

I just created a bunch of scripts for the User Scripts plugin, I have made a script per disk as follows:

/usr/local/emhttp/plugins/dynamix/scripts/notify -e "Scrubber" -s "Scrub disk1" -d "Scrub of disk1 started" -i "normal"
btrfs scrub start -rdB /mnt/disk1
if [ $? -eq 0 ]
then
   /usr/local/emhttp/plugins/dynamix/scripts/notify -e "Scrubber" -s "Scrub disk1" -d "Disk1 scrub completed; no errors" -i "normal"
else
   /usr/local/emhttp/plugins/dynamix/scripts/notify -e "Scrubber" -s "Scrub disk1" -d "Error in scrub of disk1 !" -i "alert"
fi

I have scheduled these monthly, disk 1 on day 1, disk 2 on day 2, etc. To be started at 22:00 on every evening.

This way I will get a full scrub for every disk every months with results presented via Notifications..

Quote

February 11, 20179 yr

Author

In this case checkums are useful to alert you to the problem, you'll need to restore affected files from backups or other sources.

If disk9 was one of the disks rebuilt during your recent controller issues you may want to run a scrub on the other rebuilt disks.

The scrub was completed, the webpage status shows the following:

scrub status for 8aa516d1-471a-4f1b-85e1-2a660e0d9ebf

scrub started at Sat Feb 11 06:51:38 2017, running for 04:10:56

total bytes scrubbed: 2.16TiB with 1728 errors

error details: csum=1728

corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

The log shows perfectly what file is causing the issues.. Unfortunately it appears that file (and only that file) is not in my crashplan backup.... Guess it corrupted when it got written initially and therefor could not be backupped.. I would have expected Crashplan to tell me that somehow.. I'll dive in the logs..

I am downloading the file again and that should make the issue go away..

There are also lines in the log that show no file names... I will redo the scrub when it has finished to make sure everything is ok..

Quote

February 11, 20179 yr

Author

Checked the crashplan logs... It is indeed in there.. Unfortunately it does not really help something like that beiing burried in the logs.. I will need to build something that triggers the notification system to tell something is up..

Quote

February 11, 20179 yr

Community Expert

There are also lines in the log that show no file names...

These can be metadata, I use the DUP metadata profile for all my btrfs data disks, this way I can recover for any metadata checksum issues and serious metadata corruption can result in (all) data loss.

It will use some disk space, but relatively small amount, you can see current usage by using:

btrfs fi df /mnt/diskX

Metadata DUP will use twice the current usage, to convert any disk use:

btrfs balance start -mconvert=dup /mnt/diskX

Quote

February 11, 20179 yr

Author

For my disk9 this shows:

Data, single: total=4.83TiB, used=4.70TiB
System, single: total=36.00MiB, used=576.00KiB
Metadata, single: total=8.01GiB, used=5.70GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

If that just doubles its a non-issue... That would be only 6 gigs extra... Not an issue whatsoever.. This will create a metadata duplicate on another part of the disk ? Sounds like a nice solution !

Quote

February 11, 20179 yr

Community Expert

Metadata, single: total=8.01GiB, used=5.70GiB

If that just doubles its a non-issue... That would be only 6 gigs extra... Not an issue whatsoever.. This will create a metadata duplicate on another part of the disk ? Sounds like a nice solution !

Yes, 6 to 8GiB, because btrfs allocates chunks first, you currently have 8GB allocated for metadata, but still a very small price to pay for some extra security, btw, DUP metada is the default profile for btrfs single devices (except SSDs), I already asked for this and Tom agreed so I hope this will also be the default profile for btrfs data disks in the future.

Quote

February 11, 20179 yr

Author

Good stuff... everyome should switch this on... am in the process of doing it on my drives now..

Verzonden vanaf mijn iPhone met Tapatalk

Quote

February 11, 20179 yr

Author

Something went wrong with disk8.. The DUP process got killed and the filesystem is read-only now... I am rebooting the system to see if that solves anything..

UPDATE: System refused to finish the reboot.. Telnet no longer available. I have waited for five minutes.. LAst mention in the log is "Unregister netdevice: waiting for lo to gecome free. Usage count = 1"

No telnet sessions or mappings are open. I will hard-reset the box.

Quote

February 11, 20179 yr

Community Expert

The conversion may not work if the disk it too full.

Quote

February 11, 20179 yr

Author

The conversion may not work if the disk it too full.

Doesn't look good... disk8 is not comming back..

Quote

February 11, 20179 yr

I am seeing the following errors in my log:

Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506032128 csum 1577116343 expected csum 2990069891
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506732544 csum 3515546368 expected csum 135346473
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 507781120 csum 1284283823 expected csum 1771133338
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 508829696 csum 755813392 expected csum 3990879880
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 509964288 csum 4163699914 expected csum 4089351303
Feb 10 18:55:48 Tower shfs/user: err: shfs_read: read: (5) Input/output error
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 510926848 csum 24089256 expected csum 1351376558

(for those reading other posts, this was already happening before I exchanged my sata cards)

MD9 is one of my array drives, they are BTRFS.

I am now running a scrub on it to see if it will find errors.

Is this some kind of data rot it is detecting ? The drive is fine smart-wise ..

(Sorry, I read the inital post and responded. Looking at the responses this may not be relevant - but running out the door and am posting this. If it is helpful, great. If not, ignore.)

I would doubt this is bitrot. Bitrot is debatable but certainly extremely rare. You are seeing several instances. And I expect the bits are relatively fresh. Looks like either data corruption (something wrote to the disk outside the file system), memory related, controller problem, or BTRFS filesystem bug.

I might suggest a memtest to quickly rule out memory related. That would be insidious and you'd want to fix before doing anything else.

Given that you just changed your controller, that would be my #1 suspect. Of course it is the data that is the most concerning at this point. I am not a BTRFS expert, but have to believe there is a way to run a file by file checksum test. I would explore that. See if you can identify specific files where the checksum is off. And then you could compare to backups (if you have them), compare to md5s (if you have them).

You could run a NON-correcting parity check. I have to assume that the problem - whether unlikely bitrot, errant disk, or controller problem, would not be reflected in parity. So a parity check SHOULD find parity mismatches. If it doesn't - it really points very squarely at your BTRFS itself flaking, or some malware or something corrupting the data intentionally (unlikely but have to throw that in).

If the NON-correcting check DOES find sync errors, I would suggest pulling the disk, and letting unRAID simulate the drive. You would run the checksum verification on the simulated drive, and compare backups / md5s to the simulated files and see if they match. If the simulated disk is good but the actual disk is bad - you either have a bad disk (unlikely) or a controller issue. But at least you have a version of truth regarding your data.

Quote

February 11, 20179 yr

Author

The btrfs error was there before the controller...

My issue is bigger now... It seems one BTRFS disk is now in read only mode and it refuses to mount :-(

Quote

February 11, 20179 yr

The btrfs error was there before the controller...

My issue is bigger now... It seems one BTRFS disk is now in read only mode and it refuses to mount :-(

See my prior post. Run a memtest!

Quote

February 11, 20179 yr

Community Expert

Also post diagnostics.

The disk going read only during the conversion implies there was problem already there or there is a hardware problem.

Quote

February 11, 20179 yr

Author

The btrfs error was there before the controller...

My issue is bigger now... It seems one BTRFS disk is now in read only mode and it refuses to mount :-(

See my prior post. Run a memtest!

I will.. But at the moment the volume is registered as read only and I would like to solve that first, then I will reboot and run a memtest.. I do not expect an issue there though.. I am running ECC memory and have never had issues here... This issue arose when making a change to the BTRFS volume, (setting DUP, see above), for some reason that failed and this caused BTRFS to mark the volume read only. First thing I would like to do is make sure that read-only goes away..

Anyone any idea how to do this ? I have allready googled and cannot find anything..

Quote

February 11, 20179 yr

Author

System fully crashes.. Damn.. I had to give it a hard reboot again and I started memtest.. I really do not think that will do anything but I am very gratefull for the advice..

I have regsitered a seperate thread for my issue:

https://lime-technology.com/forum/index.php?topic=56472.msg538931#msg538931

Really worried now..

Quote

BTRFS csum failed ino

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)