Jump to content

Back again with more crashing issues!


seecs2011

Recommended Posts

Over the last few years I've had issues off and on with my unraid server crashing. At first it was a c-state issue that I fixed. Then a year later, my mobo re-enabled all that stuff somehow and I made changes to c-states again as well as how the machine delivered power at idle. Both times I got stability for a few months. 

 

A couple weeks ago, crashes started speeding up again. I found some time to check BIOS assuming that the changes got made magically again but they hadn't. I've run memtest for about 15 hours and no errors were found. What else should I check for root cause of crashes? I disconnected my UPS because I had heard that that could be buggy and cause random shutdowns, but this is a hard reboot. I am at the point now where it is happening every 3-5 hours give or take. Happens regardless of if parity check is happening (last time, it was consistently after parity finished). SMART data is all good except for a couple one off command errors but I checked seagate and they are not the issue (one of the 188 attribute codes). 

 

Thoughts at this point are, in order:

 

PSU

CPU

MOBO

 

 

Not sure what else it could be. Setup is:

 

MOBO: Gigabyte Technology Co., Ltd. B550M DS3H , Version Default string
American Megatrends International, LLC., Version F19
BIOS dated: Fri 22 Mar 2024 12:00:00 AM CDT

CPU: AMD Ryzen 7 5700G with Radeon Graphics @ 3800 MHz

RAM: 64 GB Gskill

PSU: Corsair RM650x

 

Diagnostic from yesterday attached.

tower-diagnostics-20241028-0821.zip

Link to comment

There are multiple segfaults from this, but done't know what "P" is:

 

Dec  4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0)

 

One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers. 

Link to comment
11 minutes ago, JorgeB said:

There are multiple segfaults from this, but done't know what "P" is:

 

Dec  4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0)

 

One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers. 

The segfault hasn't happened in the last calendar year though and there have been dozens of crashes in the last few weeks. Also not many crashes around those segfault messages. 

 

Some of the crashes seem to happen after several attempts of this 

Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update

 

 

Any processes for hardware troubleshooting that doesn't involve rip and replace and not being able to return used items for the test? Or could that flash_backup be an issue?

Link to comment

I've run extensive testing there and am about 95% sure it isn't RAM. Interestingly - I saw that the writes to my USB were kinda high, sometimes dozens of writes at a time. 

 

image.png.4096e6d30fd71b791fcead710b5a892d.png

 

For example that pic was after like an hour of runtime. As stated in my previous reply I'm thinking there may be something to that based on the syslog around the crashes. This led me to another thread about a high number of writes and there talked about uninstalling unassigned devices preclear and myservers from plugins. I can't find myservers plugin though (even though that log file references it) I removed preclear as I had that installed and now my USB interactions look like this after 8-9 hours:

 

image.thumb.png.b68df2ff30679a2ff5b48ecc8f0ee513.png

 

That also means I have doubled my uptime average. It does seem that there was one overnight crash after around 9 or 10 hours instead of what had been 3-5... So we're getting somewhere, I think. 

 

Any other thoughts on root cause based on that information?

Link to comment
38 minutes ago, mathomas3 said:

Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test?

 

Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU

 

IMO that's where I would start

I have been leaning more and more that direction. I don't have a spare, unfortunately but am watching a bit to see over the next couple days if some of the changes I've done help. Based on the other things I've tested, I'm at its either PSU or UnRAID OS. You might have solidified my choice to grab a spare PSU to have around for spare anyway...

Link to comment

You could also try to boot into a live OS (mint linux or what ever) and run a stress test on it (sure there is something out there you could find to do this, youtube on loop?) if it crashes there then you know... if it doesnt... well that cant really rule out the PSU but... it's something

 

To eliminate the OS side of the house, you could try a fresh install of unraid but I have found that the OS is good and stable

 

Just something to consider/try

 

Link to comment

 

10 hours ago, mathomas3 said:

having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it

Thinking more and more it is power related - just had it crash after a 16 hour uptime randomly. Thought I should cancel parity check since it keeps trying to do it on reboot and then it immediately crashed again which I missed. Just cancelled that parity check and am waiting to see. I tried to up the idle load via bios but yeah. PSU gets here Friday and I'm off that day so I know what I'll be working on...

Link to comment
On 10/30/2024 at 11:35 AM, mathomas3 said:

having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it

Welp. Got the new psu installed yesterday and the crashing actually got slightly worse if anything. Ruling that out... Noticing again that the writes to flash are ticking up... Can't figure out what is causing that though as the git logs are not showing anything being updated...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...