thompn4 Posted April 5, 2023 Share Posted April 5, 2023 Hey everyone – I’m been trying to troubleshoot an error for over a month now and I can’t figure it out. I have a pretty big array (10 disks total, double parity, 68TB usable) with a btrfs cache with parity (originally I had two 256 GB and two 1 TB SSD’s for 1.25 TB of cache). A little over a month ago all my docker containers stopped working. I started getting errors saying my cache drive was read only. Restarting the server would fix things temporarily, but then at some random point (sometimes the same day, sometimes a few days later), docker would die again. Here’s what I tried: 1) Rebuilding the docker image. 2) Replacing my SSDs (took the existing 4 out and put in two brand new 1 TB SSDs 3) Changing the cables to my SSDs 4) Changing the OS version to 6.12.0-rc2 5) Changing where frigate writes to to an nvme (and not my main cache) At this point I’m kinda out of ideas. Maybe it’s still bad cables to my SSDs resulting in them being unplugged when it gets too cold in my server room? I don’t feel like I use my server significantly more/differently than anyone else, the only thing I do that’s pretty taxing is run frigate for CCTV – I have that set up for 24/7 recording so that is a lot of writing to cache, but I figured moving that to the nvme would solve that problem. Diagnostics attached, Fix Common Problems is not showing any errors right now. When I try to reenable docker, it fails to start currently. I can rebuild the docker image again, and it'll work for a while (this last time was 4 days), but I fear I’m just in a loop of rebuilding my docker image every few days. Before this it worked solidly for well over a year. unraid-diagnostics-20230405-0801.zip Quote Link to comment
Solution JorgeB Posted April 5, 2023 Solution Share Posted April 5, 2023 Mar 29 20:50:51 Unraid kernel: BTRFS info (device sdm1): bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 113, gen 0 Mar 29 20:50:51 Unraid kernel: BTRFS info (device sdm1): bdev /dev/sdl1 errs: wr 201748643, rd 247526, flush 1001565, corrupt 86544, gen 0 This shows one of the cache devices dropped offline in the past, also some possible data corrutpion, run a correcting scrub and post the results, also see here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582 and here, Ryzen with overclocked RAM like you have is known to corrupt data: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 Quote Link to comment
thompn4 Posted April 7, 2023 Author Share Posted April 7, 2023 Thanks for your help @JorgeB! When I did the 'btrfs dev stats' command there were a lot of errors. I removed the ram overclock, checked the cables to the SSDs again, ran a scrub, and set up the 'btrfs dev stats' script on a schedule. Hopefully everything will continue to work this time around. Scrub stats here: UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85 Scrub started: Fri Apr 7 00:10:39 2023 Status: finished Duration: 0:11:18 Total to scrub: 240.50GiB Rate: 389.42MiB/s Error summary: verify=3123 csum=6073889 Corrected: 6076994 Uncorrectable: 18 Unverified: 0 Quote Link to comment
JorgeB Posted April 7, 2023 Share Posted April 7, 2023 Run another scrub and post diags once it finishes. Quote Link to comment
thompn4 Posted April 7, 2023 Author Share Posted April 7, 2023 UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85 Scrub started: Fri Apr 7 06:39:42 2023 Status: finished Duration: 0:04:11 Total to scrub: 200.20GiB Rate: 918.02MiB/s Error summary: csum=18 Corrected: 0 Uncorrectable: 18 Unverified: 0 Quote Link to comment
thompn4 Posted April 7, 2023 Author Share Posted April 7, 2023 Sorry - attached now - thanks for your help!! unraid-diagnostics-20230407-0902.zip Quote Link to comment
JorgeB Posted April 7, 2023 Share Posted April 7, 2023 Some errors were uncorrectable, those files are listed in the syslog during the scrub, e.g.: Apr 7 06:39:56 Unraid kernel: BTRFS warning (device sdk1): checksum error at logical 7364440064 on dev /dev/sdj1, physical 6261338112, root 5, inode 181866, offset 86016, length 4096, links 1 (path: appdata/jellyfin/metadata/library/8a/8a0efc38e8f26aa46396532f81472c17/chapters/637569150190000000_3000000000.jpg) These should be deleted/restore from a backup, then run another scrub to confirm 0 errors. Quote Link to comment
thompn4 Posted April 7, 2023 Author Share Posted April 7, 2023 No more errors - thanks again! UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85 Scrub started: Fri Apr 7 12:30:54 2023 Status: finished Duration: 0:01:36 Total to scrub: 77.38GiB Rate: 826.39MiB/s Error summary: no errors found unraid-diagnostics-20230407-1232.zip 1 Quote Link to comment
thompn4 Posted April 10, 2023 Author Share Posted April 10, 2023 @JorgeB - not sure if you have any other advice, but I am getting a few btrfs corrupt errors (14 yesterday, 1 today) still when the 'btrfs dev stats /mnt/cache' command runs. When I run a scrub, it comes up as 0 errors, but the errors do show up in the syslog as "csum failed". Diagnostics attached. Thanks again. unraid-diagnostics-20230409-2112.zip Quote Link to comment
JorgeB Posted April 10, 2023 Share Posted April 10, 2023 At this point suggest backing up the pool then reformatting. Quote Link to comment
thompn4 Posted May 2, 2023 Author Share Posted May 2, 2023 So I found at least one of the issues - one of my sticks of ram was bad. I was able to get the server up and running, mover was working fine, could read and write to the cache drives fine. However, as soon as I started docker, the cache went to read only again. I removed the docker image and recreated it (adding apps from the previous apps section) and before it could even finish adding all my apps, the cache drive went read only again. No corruption errors this time. I'm beginning to think one of the files for docker itself (not the image) is corrupted, or something in my app data? But the server runs fine until I try to start docker. Attached are diagnostics from last night when this happened. unraid-diagnostics-20230502-0002.zip Quote Link to comment
JorgeB Posted May 2, 2023 Share Posted May 2, 2023 There's corruption on the cache filesystem, so you need to fix that first, if you had a bad RAM before best to backup what you can and reformat cache, then recreate the docker image. Quote Link to comment
trurl Posted May 2, 2023 Share Posted May 2, 2023 3 minutes ago, thompn4 said: one of my sticks of ram was bad If you've been running on bad RAM then anything has some chance of being corrupt. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.