October 29, 20241 yr Over the last few years I've had issues off and on with my unraid server crashing. At first it was a c-state issue that I fixed. Then a year later, my mobo re-enabled all that stuff somehow and I made changes to c-states again as well as how the machine delivered power at idle. Both times I got stability for a few months. A couple weeks ago, crashes started speeding up again. I found some time to check BIOS assuming that the changes got made magically again but they hadn't. I've run memtest for about 15 hours and no errors were found. What else should I check for root cause of crashes? I disconnected my UPS because I had heard that that could be buggy and cause random shutdowns, but this is a hard reboot. I am at the point now where it is happening every 3-5 hours give or take. Happens regardless of if parity check is happening (last time, it was consistently after parity finished). SMART data is all good except for a couple one off command errors but I checked seagate and they are not the issue (one of the 188 attribute codes). Thoughts at this point are, in order: PSU CPU MOBO Not sure what else it could be. Setup is: MOBO: Gigabyte Technology Co., Ltd. B550M DS3H , Version Default string American Megatrends International, LLC., Version F19 BIOS dated: Fri 22 Mar 2024 12:00:00 AM CDT CPU: AMD Ryzen 7 5700G with Radeon Graphics @ 3800 MHz RAM: 64 GB Gskill PSU: Corsair RM650x Diagnostic from yesterday attached. tower-diagnostics-20241028-0821.zip
October 29, 20241 yr Author 4 hours ago, JorgeB said: Enable the syslog server and post that after a crash. Should note - syslog has been next to useless but here is the file - note - the first log of the reboot is always "Tower root: Delaying execution of fix common problems scan for 10 minutes" so you can see from those timestamps how often it crashes now syslog-192.168.1.124.log
October 29, 20241 yr Community Expert There are multiple segfaults from this, but done't know what "P" is: Dec 4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0) One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers.
October 29, 20241 yr Author 11 minutes ago, JorgeB said: There are multiple segfaults from this, but done't know what "P" is: Dec 4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0) One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers. The segfault hasn't happened in the last calendar year though and there have been dozens of crashes in the last few weeks. Also not many crashes around those segfault messages. Some of the crashes seem to happen after several attempts of this Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update Any processes for hardware troubleshooting that doesn't involve rip and replace and not being able to return used items for the test? Or could that flash_backup be an issue?
October 30, 20241 yr Community Expert If you have multiple RAM sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.
October 30, 20241 yr Author I've run extensive testing there and am about 95% sure it isn't RAM. Interestingly - I saw that the writes to my USB were kinda high, sometimes dozens of writes at a time. For example that pic was after like an hour of runtime. As stated in my previous reply I'm thinking there may be something to that based on the syslog around the crashes. This led me to another thread about a high number of writes and there talked about uninstalling unassigned devices preclear and myservers from plugins. I can't find myservers plugin though (even though that log file references it) I removed preclear as I had that installed and now my USB interactions look like this after 8-9 hours: That also means I have doubled my uptime average. It does seem that there was one overnight crash after around 9 or 10 hours instead of what had been 3-5... So we're getting somewhere, I think. Any other thoughts on root cause based on that information?
October 30, 20241 yr Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test? Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU IMO that's where I would start
October 30, 20241 yr Author 38 minutes ago, mathomas3 said: Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test? Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU IMO that's where I would start I have been leaning more and more that direction. I don't have a spare, unfortunately but am watching a bit to see over the next couple days if some of the changes I've done help. Based on the other things I've tested, I'm at its either PSU or UnRAID OS. You might have solidified my choice to grab a spare PSU to have around for spare anyway...
October 30, 20241 yr You could also try to boot into a live OS (mint linux or what ever) and run a stress test on it (sure there is something out there you could find to do this, youtube on loop?) if it crashes there then you know... if it doesnt... well that cant really rule out the PSU but... it's something To eliminate the OS side of the house, you could try a fresh install of unraid but I have found that the OS is good and stable Just something to consider/try
October 30, 20241 yr Author 17 minutes ago, mathomas3 said: youtube on loop? lol - in like 20 chrome tabs! I do have a bootable that I could try and have a psu on its way. Something to try while I wait for sure, although as we speak I'm on my longest uptime in more than two weeks at 12 hours even
October 30, 20241 yr having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it
October 31, 20241 yr Author 10 hours ago, mathomas3 said: having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it Thinking more and more it is power related - just had it crash after a 16 hour uptime randomly. Thought I should cancel parity check since it keeps trying to do it on reboot and then it immediately crashed again which I missed. Just cancelled that parity check and am waiting to see. I tried to up the idle load via bios but yeah. PSU gets here Friday and I'm off that day so I know what I'll be working on...
October 31, 20241 yr Author New update, I am noticing a lot of writes to flash drive again. I checked what files are being updated the most and they are under the .git/objects folder. Its adding 5-10 every few minutes. I had 300 writes an hour ago and am at 600 now. Can't find anything else changing. Not sure why git is updating like crazy...
November 3, 20241 yr Author On 10/30/2024 at 11:35 AM, mathomas3 said: having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it Welp. Got the new psu installed yesterday and the crashing actually got slightly worse if anything. Ruling that out... Noticing again that the writes to flash are ticking up... Can't figure out what is causing that though as the git logs are not showing anything being updated...
November 5, 20241 yr Author Cool - well here is the latest. Replaced PSU - no dice RAM - no dice Docker turned off - finally made it through full parity check, crashed after parity was done. Checked bios - no issues to report - all config as it should be Disks are fine minus a random error on my original disks that just won't clear - the error id translates to 1 errors in 65538 operations. (its the 188 error that is ignored) I'm in the danger zone at this point - I can't finish parity checks most of the time and because it crashes when parity finishes or is cancelled, I cannot invoke the mover and disks are starting to get into an undesirable state. Please advise. EDIT: Crash occurs within roughly 7 min of parity finishing or being cancelled Edited November 5, 20241 yr by seecs2011 Added context
November 5, 20241 yr Community Expert Board or CPU would be the other main suspects, and since it's a Ryzen CPU, make sure this has been taken care of: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173
November 5, 20241 yr Author 6 minutes ago, JorgeB said: Board or CPU would be the other main suspects, and since it's a Ryzen CPU, make sure this has been taken care of: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 as previously mentioned, yes that was done.
November 5, 20241 yr Community Expert Then 34 minutes ago, JorgeB said: Board or CPU would be the other main suspects
November 9, 20241 yr Author Welp - I've now replaced all components and it is rebooting within a minute of the web portal becoming available now
November 9, 20241 yr Author Worth noting, safe mode also crashes, only stability I've had for more than a minute or two has been the memtest I've been running just to make sure the system can stay up apart from UnRAID OS for a bit...
November 9, 20241 yr Author Something broke in this process - I reinstalled the old hardware and am getting the same behavior
November 9, 20241 yr Author Now it crashed with an Ubuntu live disk...what the hell? I can't figure out what is wrong with this thing
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.