Random, consistent, and worsening kernel panics and cache corruption

January 6Jan 6

Hi everyone,

Ever since I built my server pc in November, I have been having worsening issues with stability. I built this server to run a few docker containers including Plex and other media management tools. The symptoms began as occasional BTRFS corruption errors on the NVMe cache drive and have escalated within the last month or so to full kernel panics requiring a power cycle. These kernel panics have increased in frequency from once every few days to now after about 20-30 minutes of total uptime. I have tried almost everything at this point ranging from BIOS setting changes to replacing hardware and I am honestly completely lost as to how to fix this issue. I will list out some of the symptoms, my hardware before and after changes, and all steps I have taken to address the issue so far. I have also attached my diagnostics file and syslog from right after the last kernel panic I experienced. Thanks.

HARDWARE:

CPU: AMD Ryzen 7 3700X --> replaced with AMD Ryzen 7 5700X
RAM: G.Skill Trident Z RGB 32GB (2x16) DDR4-3600 CL18 --> replaced with G.SKILL Trident Z Neo Series DDR4 32GB (2x16GB) 3600MT/s CL16
MOTHERBOARD: MSI PRO B550M-VC WIFI Micro ATX AM4
CACHE DRIVE: Crucial 1tb SSD --> replaced with Crucial P510 1TB SSD
PSU: Corsair RM850e (2025) 850W

GPU: MSI RTX 2070 SUPER

Storage: 2 10TB WD Red Plus, 1 8TB WD Red Plus, 1 8TB Seagate Barracuda, 2 2TB Seagate Barracuda

SYMPTOMS:

Began with consistent and increasing frequency BTRFS corruption errors on the cache drive
With random kernel panics or reboots every week or so
After RAM was replaced, no errors or panics for about a week and then everything came back again
Has progressed to continuous BTRFS errors and kernel panics and segfaults under load and at idle (and in safe mode) of completely random services/applications. Every single panic has cited a different application exiting as far as I can tell. Kernel panics always require full restart before server can even be accessed via webgui
Also just noticed there are some occasional I/O errors on dev loop2. not sure what that means

STEPS TAKEN SO FAR:

Replaced NVMe cache drive and fully wiped and reformatted
Ran memtest on first set of RAM and was hit with immediate and numerous errors on both sticks together and each individually. Replaced RAM and ran 8 hours of memtest with no errors.
Replaced CPU
Updated from Unraid 7.1.4 to 7.2.3
reformatted cache drive from BTRFS to ZFS (did nothing)
Followed instructions from AMD FAQ post on forum (changed C states to disabled, typical current idle, still crashed)
Recreated docker image multiple times
Updated BIOS to latest stable build

Please let me know if anything is visible from the diagnostics or the syslog or if more information is needed on my end. I am close to just replacing the motherboard and PSU just to see if that will help at this point.

theark-diagnostics-20260106-1812.zip syslog.txt

Edited January 7Jan 7 by noahofberks
added some more steps that I have taken to fix

Quote

January 7Jan 7

Did you also recreate your docker.img after fixing your RAM problems?

Quote

January 7Jan 7

Author

I recreated the docker image at this point around three separate times. It should have been fully recreated after the RAM was replaced as I replaced the cache drive at the same time and wiped my appdata and system shares. The most recent crash/panic actually happened after I recreated the docker image and re-added a few of my containers.

Quote

January 7Jan 7

Author

Quick update on this issue. I have rebooted again and left the server on for a few hours to collect more crash data, going to attach that log to this post for more data. Additionally, I have run memtest again and am now seeing immediate and high frequency of errors. This is confusing as this RAM passed 8 hours of memtest with no problems when I first installed it less than a month ago. Could something else be messing with the RAM to cause these problems and show errors in memtest or did I just get incredibly unlucky and get more bad RAM?

syslog-detailed.txt

Quote

January 7Jan 7

If you are still getting RAM errors, most likely the RAM is the problem. Note that Memtest doesn't always detect them.

I would run the server with just one stick of RAM, if it still crashes, try the other one. That will basically rule out bad RAM.

Quote

January 8Jan 8

Author

Alright I have done some testing of the RAM sticks. It appears that both sticks show almost immediate errors in memtest individually. They also maintain these errors in another system I have. I am not entirely sure what this means. I did boot the server with just one stick (one that errored in memtest individually) and it was able to stay up for a lot longer than with both in. Not sure if this is just random crash pattern or a symptom. Likely going to replace motherboard and PSU this weekend unless anyone thinks of anything else.

Quote

January 8Jan 8

You must not attempt to run any computer unless memory is working perfectly. Everything goes through RAM. The OS and other executable code, your data, EVERYTHING. The CPU can't do anything with anything until it is loaded into RAM.

Quote

January 8Jan 8

Author

Yes this makes sense. I was trying to see if the errors were repeatable on different (old no longer in use) hardware and it appears they are. Not sure if this means I just got a bad memory kit with two defective sticks or if there is something else causing them to fail.

Quote

Random, consistent, and worsening kernel panics and cache corruption

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)