seecs2011 Posted Tuesday at 02:21 PM Share Posted Tuesday at 02:21 PM Over the last few years I've had issues off and on with my unraid server crashing. At first it was a c-state issue that I fixed. Then a year later, my mobo re-enabled all that stuff somehow and I made changes to c-states again as well as how the machine delivered power at idle. Both times I got stability for a few months. A couple weeks ago, crashes started speeding up again. I found some time to check BIOS assuming that the changes got made magically again but they hadn't. I've run memtest for about 15 hours and no errors were found. What else should I check for root cause of crashes? I disconnected my UPS because I had heard that that could be buggy and cause random shutdowns, but this is a hard reboot. I am at the point now where it is happening every 3-5 hours give or take. Happens regardless of if parity check is happening (last time, it was consistently after parity finished). SMART data is all good except for a couple one off command errors but I checked seagate and they are not the issue (one of the 188 attribute codes). Thoughts at this point are, in order: PSU CPU MOBO Not sure what else it could be. Setup is: MOBO: Gigabyte Technology Co., Ltd. B550M DS3H , Version Default string American Megatrends International, LLC., Version F19 BIOS dated: Fri 22 Mar 2024 12:00:00 AM CDT CPU: AMD Ryzen 7 5700G with Radeon Graphics @ 3800 MHz RAM: 64 GB Gskill PSU: Corsair RM650x Diagnostic from yesterday attached. tower-diagnostics-20241028-0821.zip Quote Link to comment
JorgeB Posted Tuesday at 02:29 PM Share Posted Tuesday at 02:29 PM Enable the syslog server and post that after a crash. Quote Link to comment
seecs2011 Posted Tuesday at 06:52 PM Author Share Posted Tuesday at 06:52 PM 4 hours ago, JorgeB said: Enable the syslog server and post that after a crash. Should note - syslog has been next to useless but here is the file - note - the first log of the reboot is always "Tower root: Delaying execution of fix common problems scan for 10 minutes" so you can see from those timestamps how often it crashes now syslog-192.168.1.124.log Quote Link to comment
JorgeB Posted Tuesday at 07:02 PM Share Posted Tuesday at 07:02 PM There are multiple segfaults from this, but done't know what "P" is: Dec 4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0) One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers. Quote Link to comment
seecs2011 Posted Tuesday at 07:19 PM Author Share Posted Tuesday at 07:19 PM 11 minutes ago, JorgeB said: There are multiple segfaults from this, but done't know what "P" is: Dec 4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0) One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers. The segfault hasn't happened in the last calendar year though and there have been dozens of crashes in the last few weeks. Also not many crashes around those segfault messages. Some of the crashes seem to happen after several attempts of this Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update Any processes for hardware troubleshooting that doesn't involve rip and replace and not being able to return used items for the test? Or could that flash_backup be an issue? Quote Link to comment
JorgeB Posted Wednesday at 08:58 AM Share Posted Wednesday at 08:58 AM If you have multiple RAM sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM. Quote Link to comment
seecs2011 Posted Wednesday at 01:36 PM Author Share Posted Wednesday at 01:36 PM I've run extensive testing there and am about 95% sure it isn't RAM. Interestingly - I saw that the writes to my USB were kinda high, sometimes dozens of writes at a time. For example that pic was after like an hour of runtime. As stated in my previous reply I'm thinking there may be something to that based on the syslog around the crashes. This led me to another thread about a high number of writes and there talked about uninstalling unassigned devices preclear and myservers from plugins. I can't find myservers plugin though (even though that log file references it) I removed preclear as I had that installed and now my USB interactions look like this after 8-9 hours: That also means I have doubled my uptime average. It does seem that there was one overnight crash after around 9 or 10 hours instead of what had been 3-5... So we're getting somewhere, I think. Any other thoughts on root cause based on that information? Quote Link to comment
mathomas3 Posted Wednesday at 02:33 PM Share Posted Wednesday at 02:33 PM Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test? Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU IMO that's where I would start Quote Link to comment
seecs2011 Posted Wednesday at 03:14 PM Author Share Posted Wednesday at 03:14 PM 38 minutes ago, mathomas3 said: Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test? Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU IMO that's where I would start I have been leaning more and more that direction. I don't have a spare, unfortunately but am watching a bit to see over the next couple days if some of the changes I've done help. Based on the other things I've tested, I'm at its either PSU or UnRAID OS. You might have solidified my choice to grab a spare PSU to have around for spare anyway... Quote Link to comment
mathomas3 Posted Wednesday at 04:10 PM Share Posted Wednesday at 04:10 PM You could also try to boot into a live OS (mint linux or what ever) and run a stress test on it (sure there is something out there you could find to do this, youtube on loop?) if it crashes there then you know... if it doesnt... well that cant really rule out the PSU but... it's something To eliminate the OS side of the house, you could try a fresh install of unraid but I have found that the OS is good and stable Just something to consider/try Quote Link to comment
seecs2011 Posted Wednesday at 04:29 PM Author Share Posted Wednesday at 04:29 PM 17 minutes ago, mathomas3 said: youtube on loop? lol - in like 20 chrome tabs! I do have a bootable that I could try and have a psu on its way. Something to try while I wait for sure, although as we speak I'm on my longest uptime in more than two weeks at 12 hours even Quote Link to comment
mathomas3 Posted Wednesday at 04:35 PM Share Posted Wednesday at 04:35 PM having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it Quote Link to comment
seecs2011 Posted Thursday at 02:56 AM Author Share Posted Thursday at 02:56 AM 10 hours ago, mathomas3 said: having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it Thinking more and more it is power related - just had it crash after a 16 hour uptime randomly. Thought I should cancel parity check since it keeps trying to do it on reboot and then it immediately crashed again which I missed. Just cancelled that parity check and am waiting to see. I tried to up the idle load via bios but yeah. PSU gets here Friday and I'm off that day so I know what I'll be working on... Quote Link to comment
seecs2011 Posted Thursday at 02:32 PM Author Share Posted Thursday at 02:32 PM New update, I am noticing a lot of writes to flash drive again. I checked what files are being updated the most and they are under the .git/objects folder. Its adding 5-10 every few minutes. I had 300 writes an hour ago and am at 600 now. Can't find anything else changing. Not sure why git is updating like crazy... Quote Link to comment
seecs2011 Posted 11 hours ago Author Share Posted 11 hours ago On 10/30/2024 at 11:35 AM, mathomas3 said: having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it Welp. Got the new psu installed yesterday and the crashing actually got slightly worse if anything. Ruling that out... Noticing again that the writes to flash are ticking up... Can't figure out what is causing that though as the git logs are not showing anything being updated... Quote Link to comment
seecs2011 Posted 10 hours ago Author Share Posted 10 hours ago Anyone want to give me a sanity check on this pre-boot stuff? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.