Jump to content

Thousands of checksum error on BTRFS drive. Why?


Go to solution Solved by JorgeB,

Recommended Posts

PREAMBLE

So, without actually experiencing any issues, I went to the syslog and saw many entries like the following:

Dec 29 18:00:21 UnBlue kernel: BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 42601, gen 0

All of those point to the same device: md1.

root@UnBlue:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
rootfs           32G  1.6G   30G   5% /
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
cgroup_root     8.0M     0  8.0M   0% /sys/fs/cgroup
tmpfs           128M  2.7M  126M   3% /var/log
/dev/sda1        15G  947M   14G   7% /boot
overlay          32G  1.6G   30G   5% /lib/modules
overlay          32G  1.6G   30G   5% /lib/firmware
tmpfs           1.0M     0  1.0M   0% /mnt/disks
tmpfs           1.0M     0  1.0M   0% /mnt/remotes
/dev/md1        7.3T  5.9T  1.5T  80% /mnt/disk1
/dev/md2        7.3T  1.4T  6.0T  18% /mnt/disk2
/dev/md3        1.9T  372M  1.9T   1% /mnt/disk3
/dev/nvme0n1p1  932G  580G  351G  63% /mnt/cache
shfs             17T  7.2T  9.3T  44% /mnt/user0
shfs             17T  7.2T  9.3T  44% /mnt/user
/dev/loop2       24G  4.7G   20G  20% /var/lib/docker
/dev/loop3      1.0G  5.2M  904M   1% /etc/libvirt

So I go and check the disk1 drive. SMART test shows no errors. scrub does find checksum errors.

UUID:             328d4811-6fe2-4785-a6c3-1d05a7cf6133
Scrub started:    Wed Dec 29 12:42:49 2021
Status:           finished
Duration:         7:59:57
Total to scrub:   5.80TiB
Rate:             211.30MiB/s
Error summary:    csum=7411
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

That sounds like a lot of errors. I can see the files with those errors in the syslog now. They are quite a lot too, spread among different folders: Plex videos, some Steam games, 2 of my VM disks... That's probably irrelevant though.

I have no idea if those files are fine and the error is actually on the checksum so I don't intend to fix them with scrub. And I can delete all the problematic files. It will take some time to do it safely and make sure I don't lose anything from the VMs.

 

WHAT THIS POST IS ACTUALLY ABOUT

This post is about figuring out what caused those errors. What can I do about it?

Some extra information: All of the drives that form the array (parity disk included) are new (months old) to the system. Up until recently I had only a parity and another disk. But they started to report some "Reallocated sector count" that eventually turned into a few "Offline uncorrectable", so I replaced them and saw the opportunity to expand the array. Is it possible I've carried some issue from the past drives into these?

Edited by BlueSialia
Typo
Link to comment
  • BlueSialia changed the title to Thousands of checksum error on BTRFS drive. Why?
  • Solution
15 hours ago, BlueSialia said:
corrupt 42601

These mean data corruption is being detected, most often due to RAM issues, in your case and since you're running a Ryzen CPU with above max spec RAM speeds that would be the main suspect, could also just be bad RAM or other hardware issue.

 

After the problem is fixed a scrub will list the corrupt the corrupt files in the syslog, those need to be deleted/replaced from backups.

Link to comment

Hmmm, didn't know my CPU wasn't meant to have 3200 MT/s RAM when using 4 sticks. It's true the default in the BIOS was lower. In order to increase it I used a XMP profile (is that the name of it?) the BIOS itself suggested to me, so I thought it was safe.

 

Assuming that's the cause and the RAM modules themselves and other hardware is healthy, should I then tune it back to 2667 MT/s to avoid further corruption? Or is there some other recommended procedure that will allow me to keep it at 3200 MT/s?

Edited by BlueSialia
Link to comment

I've never been into overclocking so I have no idea how risky each thing is. I knew XMP was an overclock, but since it was a suggestion from the system itself and I didn't know about the limit for Ryzen CPUs I guess I didn't worry at all.

 

Anyway, thank you for the help. I hope I don't see further corruption after undoing the overclock.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...