calabriel Posted June 17 Share Posted June 17 (edited) Hello! I have been using Unraid for about a month and a half now, and have been dealing with random crashes essentially the entire time. I've lurked on this forum and the subreddit, looking to solve my own issues, but have had no luck. My original system was an old gaming PC that I had from around 2017, but now I am running a system with 100% new parts and it is still crashing. My server will crash randomly. Sometimes it stays on for 6 hours, sometimes 30 hours, sometimes 2 hours. I have set up syslog to a local share, mirrored it to my flash drive, but there is no rhyme or reason in the syslog. I found a thread where there was an issue with 1 Gen Ryzen CPUs, so I replaced that with an Intel chip along with most other parts just to do a full upgrade. Still encountering crashes. I have set it up today with just two of my Docker containers running and we are at 9 hours runtime; I am going to leave all my other containers off and see if I can make it more than three days without a crash. If so, I'm going to add a few more containers and see how things go from there. I've also attached my diags as well as my Syslog from the last three days, in case anyone can help me find a reason before my testing ends. Syslog is already being forwarded to the internal syslog server. In this testing, I'd like to also forward the logs from all of my Docker containers to the same share as my Unraid syslog, but I am not familiar enough with Docker yet to know how to do that. I'm doing my own research into that, but if anyone here knows of Unraid-specific commands I would be very grateful. Thanks! EDIT: I have enabled log forwarding from my Docker containers to the syslog server using the following command: --log-driver syslog --log-opt syslog-address=udp://x.x.x.x:514 --log-opt tag={{.Name}} I haven't had a crash since my original post, so I am going to let it ride for a couple of days before enabling other containers. kirane-diagnostics-20240615-2300.zip Syslog_Since_Rebuild.log Edited June 18 by calabriel Update Quote Link to comment
JorgeB Posted June 18 Share Posted June 18 There are multiple apps segfaulting, start by running memtest, but memtest is only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM. Quote Link to comment
calabriel Posted June 18 Author Share Posted June 18 (edited) I ran Memtest for three passes on the original used RAM in the AMD build, and then for one pass on the brand new RAM I put into the Intel build. There were no errors found. I can run it again on the new RAM to really bake it in. I do have multiple sticks as well and pulling those is easy enough. Having the same/similar issue on two different sets of RAM bought years apart would be an interesting coincidence. Thanks, I'll be back with results! Edited June 18 by calabriel Quote Link to comment
Solution JorgeB Posted June 18 Solution Share Posted June 18 Something is causing the data corruption, and 9 times out of 10 it's RAM related, but it could be a different issue, after any change you make, scrub the pool, then reset the stats, and keep monitoring for more corruption errors, if new ones appear there's still a problem, more info below: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582 Quote Link to comment
calabriel Posted June 18 Author Share Posted June 18 Good to know. I'll have to check in on this after work. I did run the ZFS scrub once, before running it showed no errors but after running it showed one checksum error. I'll update in about 12-13 hours when I get back home and can run Memtest and check on the cache error. Quote Link to comment
calabriel Posted June 18 Author Share Posted June 18 "zpool status -v" showed one error in a Tdarr file. Tdarr is currently off and I don't need that file, so I deleted it and re-ran the scrub. The pool currently shows as clean, my syslog shows normal activity from Unraid and all active containers, and we have the longest uptime in a week or so. I'm still going to Memtest, probably overnight tonight. Quote Link to comment
calabriel Posted June 19 Author Share Posted June 19 Shut down the server to run Memtest, and we have major errors (561 of them and counting, in pass 2). I'm refunding this set of RAM since I got it less than a week ago and I'm getting a new set. Before shutdown, I had 48 hours of uptime and I think everything was stable without Tdarr. I think I'm clear once this RAM is replaced and I find an alternative to Tdarr. Thanks Jorge! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.