BlueSialia Posted December 29, 2021 Share Posted December 29, 2021 (edited) PREAMBLE So, without actually experiencing any issues, I went to the syslog and saw many entries like the following: Dec 29 18:00:21 UnBlue kernel: BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 42601, gen 0 All of those point to the same device: md1. root@UnBlue:~# df -h Filesystem Size Used Avail Use% Mounted on rootfs 32G 1.6G 30G 5% / devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm cgroup_root 8.0M 0 8.0M 0% /sys/fs/cgroup tmpfs 128M 2.7M 126M 3% /var/log /dev/sda1 15G 947M 14G 7% /boot overlay 32G 1.6G 30G 5% /lib/modules overlay 32G 1.6G 30G 5% /lib/firmware tmpfs 1.0M 0 1.0M 0% /mnt/disks tmpfs 1.0M 0 1.0M 0% /mnt/remotes /dev/md1 7.3T 5.9T 1.5T 80% /mnt/disk1 /dev/md2 7.3T 1.4T 6.0T 18% /mnt/disk2 /dev/md3 1.9T 372M 1.9T 1% /mnt/disk3 /dev/nvme0n1p1 932G 580G 351G 63% /mnt/cache shfs 17T 7.2T 9.3T 44% /mnt/user0 shfs 17T 7.2T 9.3T 44% /mnt/user /dev/loop2 24G 4.7G 20G 20% /var/lib/docker /dev/loop3 1.0G 5.2M 904M 1% /etc/libvirt So I go and check the disk1 drive. SMART test shows no errors. scrub does find checksum errors. UUID: 328d4811-6fe2-4785-a6c3-1d05a7cf6133 Scrub started: Wed Dec 29 12:42:49 2021 Status: finished Duration: 7:59:57 Total to scrub: 5.80TiB Rate: 211.30MiB/s Error summary: csum=7411 Corrected: 0 Uncorrectable: 0 Unverified: 0 That sounds like a lot of errors. I can see the files with those errors in the syslog now. They are quite a lot too, spread among different folders: Plex videos, some Steam games, 2 of my VM disks... That's probably irrelevant though. I have no idea if those files are fine and the error is actually on the checksum so I don't intend to fix them with scrub. And I can delete all the problematic files. It will take some time to do it safely and make sure I don't lose anything from the VMs. WHAT THIS POST IS ACTUALLY ABOUT This post is about figuring out what caused those errors. What can I do about it? Some extra information: All of the drives that form the array (parity disk included) are new (months old) to the system. Up until recently I had only a parity and another disk. But they started to report some "Reallocated sector count" that eventually turned into a few "Offline uncorrectable", so I replaced them and saw the opportunity to expand the array. Is it possible I've carried some issue from the past drives into these? Edited December 29, 2021 by BlueSialia Typo Quote Link to comment
JorgeB Posted December 30, 2021 Share Posted December 30, 2021 Please post the diagnostics. Quote Link to comment
BlueSialia Posted December 30, 2021 Author Share Posted December 30, 2021 Sure! Here it is. unblue-diagnostics-20211230-1018.zip Quote Link to comment
Solution JorgeB Posted December 30, 2021 Solution Share Posted December 30, 2021 15 hours ago, BlueSialia said: corrupt 42601 These mean data corruption is being detected, most often due to RAM issues, in your case and since you're running a Ryzen CPU with above max spec RAM speeds that would be the main suspect, could also just be bad RAM or other hardware issue. After the problem is fixed a scrub will list the corrupt the corrupt files in the syslog, those need to be deleted/replaced from backups. Quote Link to comment
BlueSialia Posted December 30, 2021 Author Share Posted December 30, 2021 (edited) Hmmm, didn't know my CPU wasn't meant to have 3200 MT/s RAM when using 4 sticks. It's true the default in the BIOS was lower. In order to increase it I used a XMP profile (is that the name of it?) the BIOS itself suggested to me, so I thought it was safe. Assuming that's the cause and the RAM modules themselves and other hardware is healthy, should I then tune it back to 2667 MT/s to avoid further corruption? Or is there some other recommended procedure that will allow me to keep it at 3200 MT/s? Edited December 30, 2021 by BlueSialia Quote Link to comment
itimpi Posted December 30, 2021 Share Posted December 30, 2021 1 minute ago, BlueSialia said: used a XMP profile XMP profiles are always overclocks. You may get away with it but they are typically not a good idea when stability is more important than performance. Quote Link to comment
BlueSialia Posted December 30, 2021 Author Share Posted December 30, 2021 I've never been into overclocking so I have no idea how risky each thing is. I knew XMP was an overclock, but since it was a suggestion from the system itself and I didn't know about the limit for Ryzen CPUs I guess I didn't worry at all. Anyway, thank you for the help. I hope I don't see further corruption after undoing the overclock. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.