BTRFS csum failed ino


Recommended Posts

I am seeing the following errors in my log:

 

Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506032128 csum 1577116343 expected csum 2990069891
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506732544 csum 3515546368 expected csum 135346473
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 507781120 csum 1284283823 expected csum 1771133338
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 508829696 csum 755813392 expected csum 3990879880
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 509964288 csum 4163699914 expected csum 4089351303
Feb 10 18:55:48 Tower shfs/user: err: shfs_read: read: (5) Input/output error
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 510926848 csum 24089256 expected csum 1351376558

 

(for those reading other posts, this was already happening before I exchanged my sata cards)

 

MD9 is one of my array drives, they are BTRFS.

 

I am now running a scrub on it to see if it will find errors.

 

Is this some kind of data rot it is detecting ?  The drive is fine smart-wise ..

Link to comment

[/dev/md9].write_io_errs   0
[/dev/md9].read_io_errs    0
[/dev/md9].flush_io_errs   0
[/dev/md9].corruption_errs 1728
[/dev/md9].generation_errs 0

 

Which is:

 

corruption_errs

A block checksum mismatched or a corrupted metadata header was found

 

I checked all devices (really nice command) and all other disks are fine.. I am now running a non-correcting scrub and will then run a correcting scrub..

 

Link to comment

In this case checkums are useful to alert you to the problem, you'll need to restore affected files from backups or other sources.

 

If disk9 was one of the disks rebuilt during your recent controller issues you may want to run a scrub on the other rebuilt disks.

 

I get that and I have backups... Only thing I do not know is how to make the correlation between the corruption found and the specific file.. I am not running specific checksumming dockers..

 

If that is what is needed I will not be able to fix the issue and would like to "correct" the filesystem to reflect the current (and possible corrupt) situation.. Any idea how to do this ?

Link to comment

I just created a bunch of scripts for the User Scripts plugin, I have made a script per disk as follows:

 

/usr/local/emhttp/plugins/dynamix/scripts/notify -e "Scrubber" -s "Scrub disk1" -d "Scrub of disk1 started" -i "normal"
btrfs scrub start -rdB /mnt/disk1
if [ $? -eq 0 ]
then
   /usr/local/emhttp/plugins/dynamix/scripts/notify -e "Scrubber" -s "Scrub disk1" -d "Disk1 scrub completed; no errors" -i "normal"
else
   /usr/local/emhttp/plugins/dynamix/scripts/notify -e "Scrubber" -s "Scrub disk1" -d "Error in scrub of disk1 !" -i "alert"
fi

 

I have scheduled these monthly, disk 1 on day 1, disk 2 on day 2, etc. To be started at 22:00 on every evening.

 

This way I will get a full scrub for every disk every months with results presented via Notifications..

Link to comment

In this case checkums are useful to alert you to the problem, you'll need to restore affected files from backups or other sources.

 

If disk9 was one of the disks rebuilt during your recent controller issues you may want to run a scrub on the other rebuilt disks.

 

The scrub was completed, the webpage status shows the following:

 

scrub status for 8aa516d1-471a-4f1b-85e1-2a660e0d9ebf

scrub started at Sat Feb 11 06:51:38 2017, running for 04:10:56

total bytes scrubbed: 2.16TiB with 1728 errors

error details: csum=1728

corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

 

The log shows perfectly what file is causing the issues.. Unfortunately it appears that file (and only that file) is not in my crashplan backup.... Guess it corrupted when it got written initially and therefor could not be backupped.. I would have expected Crashplan to tell me that somehow.. I'll dive in the logs..

 

I am downloading the file again and that should make the issue go away..

 

There are also lines in the log that show no file names... I will redo the scrub when it has finished to make sure everything is ok..

 

Link to comment

There are also lines in the log that show no file names...

 

These can be metadata, I use the DUP metadata profile for all my btrfs data disks, this way I can recover for any metadata checksum issues and serious metadata corruption can result in (all) data loss.

 

It will use some disk space, but relatively small amount, you can see current usage by using:

 

btrfs fi df /mnt/diskX

 

Metadata DUP will use twice the current usage, to convert any disk use:

 

btrfs balance start -mconvert=dup /mnt/diskX

  • Like 1
Link to comment

For my disk9 this shows:

 

Data, single: total=4.83TiB, used=4.70TiB
System, single: total=36.00MiB, used=576.00KiB
Metadata, single: total=8.01GiB, used=5.70GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

 

If that just doubles its a non-issue... That would be only 6 gigs extra... Not an issue whatsoever..  This will create a metadata duplicate on another part of the disk ? Sounds like a nice solution !

Link to comment

Metadata, single: total=8.01GiB, used=5.70GiB

 

If that just doubles its a non-issue... That would be only 6 gigs extra... Not an issue whatsoever..  This will create a metadata duplicate on another part of the disk ? Sounds like a nice solution !

 

Yes, 6 to 8GiB, because btrfs allocates chunks first, you currently have 8GB allocated for metadata, but still a very small price to pay for some extra security, btw, DUP metada is the default profile for btrfs single devices (except SSDs), I already asked for this and Tom agreed so I hope this will also be the default profile for btrfs data disks in the future.

 

 

Link to comment

Something went wrong with disk8.. The DUP process got killed and the filesystem is read-only now... I am rebooting the system to see if that solves anything..

 

UPDATE: System refused to finish the reboot.. Telnet no longer available. I have waited for five minutes.. LAst mention in the log is "Unregister netdevice: waiting for lo to gecome free. Usage count = 1"

 

No telnet sessions or mappings are open. I will hard-reset the box.

 

 

Link to comment

I am seeing the following errors in my log:

 

Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506032128 csum 1577116343 expected csum 2990069891
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 506732544 csum 3515546368 expected csum 135346473
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 507781120 csum 1284283823 expected csum 1771133338
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 508829696 csum 755813392 expected csum 3990879880
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 509964288 csum 4163699914 expected csum 4089351303
Feb 10 18:55:48 Tower shfs/user: err: shfs_read: read: (5) Input/output error
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 504983552 csum 4004706540 expected csum 1180943189
Feb 10 18:55:48 Tower kernel: BTRFS warning (device md9): csum failed ino 1137873 off 510926848 csum 24089256 expected csum 1351376558

 

(for those reading other posts, this was already happening before I exchanged my sata cards)

 

MD9 is one of my array drives, they are BTRFS.

 

I am now running a scrub on it to see if it will find errors.

 

Is this some kind of data rot it is detecting ?  The drive is fine smart-wise ..

 

(Sorry, I read the inital post and responded. Looking at the responses this may not be relevant - but running out the door and am posting this. If it is helpful, great. If not, ignore.)

 

I would doubt this is bitrot. Bitrot is debatable but certainly extremely rare. You are seeing several instances. And I expect the bits are relatively fresh. Looks like either data corruption (something wrote to the disk outside the file system), memory related, controller problem, or BTRFS filesystem bug.

 

I might suggest a memtest to quickly rule out memory related. That would be insidious and you'd want to fix before doing anything else.

 

Given that you just changed your controller, that would be my #1 suspect. Of course it is the data that is the most concerning at this point. I am not a BTRFS expert, but have to believe there is a way to run a file by file checksum test. I would explore that. See if you can identify specific files where the checksum is off. And then you could compare to backups (if you have them), compare to md5s (if you have them).

 

You could run a NON-correcting parity check. I have to assume that the problem - whether unlikely bitrot, errant disk, or controller problem, would not be reflected in parity. So a parity check SHOULD find parity mismatches. If it doesn't - it really points very squarely at your BTRFS itself flaking, or some malware or something corrupting the data intentionally (unlikely but have to throw that in).

 

If the NON-correcting check DOES find sync errors, I would suggest pulling the disk, and letting unRAID simulate the drive. You would run the checksum verification on the simulated drive, and compare backups / md5s to the simulated files and see if they match. If the simulated disk is good but the actual disk is bad - you either have a bad disk (unlikely) or a controller issue. But at least you have a version of truth regarding your data.

Link to comment

The btrfs error was there before the controller...

 

My issue is bigger now... It seems one BTRFS disk is now in read only mode and it refuses to mount :-(

 

See my prior post. Run a memtest!

 

I will.. But at the moment the volume is registered as read only and I would like to solve that first, then I will reboot and run a memtest.. I do not expect an issue there though.. I am running ECC memory and have never had issues here... This issue arose when making a change to the BTRFS volume, (setting DUP, see above), for some reason that failed and this caused BTRFS to mark the volume read only. First thing I would like to do is make sure that read-only goes away..

 

Anyone any idea how to do this ?  I have allready googled  and cannot find anything..

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.