January 6Jan 6 Hi everyone,Ever since I built my server pc in November, I have been having worsening issues with stability. I built this server to run a few docker containers including Plex and other media management tools. The symptoms began as occasional BTRFS corruption errors on the NVMe cache drive and have escalated within the last month or so to full kernel panics requiring a power cycle. These kernel panics have increased in frequency from once every few days to now after about 20-30 minutes of total uptime. I have tried almost everything at this point ranging from BIOS setting changes to replacing hardware and I am honestly completely lost as to how to fix this issue. I will list out some of the symptoms, my hardware before and after changes, and all steps I have taken to address the issue so far. I have also attached my diagnostics file and syslog from right after the last kernel panic I experienced. Thanks.HARDWARE:CPU: AMD Ryzen 7 3700X --> replaced with AMD Ryzen 7 5700XRAM: G.Skill Trident Z RGB 32GB (2x16) DDR4-3600 CL18 --> replaced with G.SKILL Trident Z Neo Series DDR4 32GB (2x16GB) 3600MT/s CL16MOTHERBOARD: MSI PRO B550M-VC WIFI Micro ATX AM4CACHE DRIVE: Crucial 1tb SSD --> replaced with Crucial P510 1TB SSDPSU: Corsair RM850e (2025) 850WGPU: MSI RTX 2070 SUPERStorage: 2 10TB WD Red Plus, 1 8TB WD Red Plus, 1 8TB Seagate Barracuda, 2 2TB Seagate BarracudaSYMPTOMS:Began with consistent and increasing frequency BTRFS corruption errors on the cache driveWith random kernel panics or reboots every week or soAfter RAM was replaced, no errors or panics for about a week and then everything came back againHas progressed to continuous BTRFS errors and kernel panics and segfaults under load and at idle (and in safe mode) of completely random services/applications. Every single panic has cited a different application exiting as far as I can tell. Kernel panics always require full restart before server can even be accessed via webguiAlso just noticed there are some occasional I/O errors on dev loop2. not sure what that meansSTEPS TAKEN SO FAR:Replaced NVMe cache drive and fully wiped and reformattedRan memtest on first set of RAM and was hit with immediate and numerous errors on both sticks together and each individually. Replaced RAM and ran 8 hours of memtest with no errors.Replaced CPUUpdated from Unraid 7.1.4 to 7.2.3reformatted cache drive from BTRFS to ZFS (did nothing)Followed instructions from AMD FAQ post on forum (changed C states to disabled, typical current idle, still crashed)Recreated docker image multiple timesUpdated BIOS to latest stable buildPlease let me know if anything is visible from the diagnostics or the syslog or if more information is needed on my end. I am close to just replacing the motherboard and PSU just to see if that will help at this point.theark-diagnostics-20260106-1812.zip syslog.txt Edited January 7Jan 7 by noahofberks added some more steps that I have taken to fix
January 7Jan 7 Author I recreated the docker image at this point around three separate times. It should have been fully recreated after the RAM was replaced as I replaced the cache drive at the same time and wiped my appdata and system shares. The most recent crash/panic actually happened after I recreated the docker image and re-added a few of my containers.
January 7Jan 7 Author Quick update on this issue. I have rebooted again and left the server on for a few hours to collect more crash data, going to attach that log to this post for more data. Additionally, I have run memtest again and am now seeing immediate and high frequency of errors. This is confusing as this RAM passed 8 hours of memtest with no problems when I first installed it less than a month ago. Could something else be messing with the RAM to cause these problems and show errors in memtest or did I just get incredibly unlucky and get more bad RAM? syslog-detailed.txt
January 7Jan 7 If you are still getting RAM errors, most likely the RAM is the problem. Note that Memtest doesn't always detect them.I would run the server with just one stick of RAM, if it still crashes, try the other one. That will basically rule out bad RAM.
January 8Jan 8 Author Alright I have done some testing of the RAM sticks. It appears that both sticks show almost immediate errors in memtest individually. They also maintain these errors in another system I have. I am not entirely sure what this means. I did boot the server with just one stick (one that errored in memtest individually) and it was able to stay up for a lot longer than with both in. Not sure if this is just random crash pattern or a symptom. Likely going to replace motherboard and PSU this weekend unless anyone thinks of anything else.
January 8Jan 8 You must not attempt to run any computer unless memory is working perfectly. Everything goes through RAM. The OS and other executable code, your data, EVERYTHING. The CPU can't do anything with anything until it is loaded into RAM.
January 8Jan 8 Author Yes this makes sense. I was trying to see if the errors were repeatable on different (old no longer in use) hardware and it appears they are. Not sure if this means I just got a bad memory kit with two defective sticks or if there is something else causing them to fail.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.