Back again with more crashing issues!

October 29, 20241 yr

Over the last few years I've had issues off and on with my unraid server crashing. At first it was a c-state issue that I fixed. Then a year later, my mobo re-enabled all that stuff somehow and I made changes to c-states again as well as how the machine delivered power at idle. Both times I got stability for a few months.

A couple weeks ago, crashes started speeding up again. I found some time to check BIOS assuming that the changes got made magically again but they hadn't. I've run memtest for about 15 hours and no errors were found. What else should I check for root cause of crashes? I disconnected my UPS because I had heard that that could be buggy and cause random shutdowns, but this is a hard reboot. I am at the point now where it is happening every 3-5 hours give or take. Happens regardless of if parity check is happening (last time, it was consistently after parity finished). SMART data is all good except for a couple one off command errors but I checked seagate and they are not the issue (one of the 188 attribute codes).

Thoughts at this point are, in order:

PSU

CPU

MOBO

Not sure what else it could be. Setup is:

MOBO: Gigabyte Technology Co., Ltd. B550M DS3H , Version Default string
American Megatrends International, LLC., Version F19
BIOS dated: Fri 22 Mar 2024 12:00:00 AM CDT

CPU: AMD Ryzen 7 5700G with Radeon Graphics @ 3800 MHz

RAM: 64 GB Gskill

PSU: Corsair RM650x

Diagnostic from yesterday attached.

tower-diagnostics-20241028-0821.zip

Quote

October 29, 20241 yr

Community Expert

Enable the syslog server and post that after a crash.

Quote

October 29, 20241 yr

Author

4 hours ago, JorgeB said:

Enable the syslog server and post that after a crash.

Should note - syslog has been next to useless but here is the file - note - the first log of the reboot is always "Tower root: Delaying execution of fix common problems scan for 10 minutes" so you can see from those timestamps how often it crashes now

syslog-192.168.1.124.log

Quote

October 29, 20241 yr

Community Expert

There are multiple segfaults from this, but done't know what "P" is:

Dec  4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0)

One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers.

Quote

October 29, 20241 yr

Author

11 minutes ago, JorgeB said:
There are multiple segfaults from this, but done't know what "P" is:
Dec  4 06:50:07 TOWER kernel: P[proc][21012]: segfault at 100000002 ip 0000149a8c174d59 sp 0000149a79ccc1b8 error 4 in libc.so.6[149a8c045000+155000] likely on CPU 13 (core 5, socket 0)
One thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers.

The segfault hasn't happened in the last calendar year though and there have been dozens of crashes in the last few weeks. Also not many crashes around those segfault messages.

Some of the crashes seem to happen after several attempts of this

Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update

Any processes for hardware troubleshooting that doesn't involve rip and replace and not being able to return used items for the test? Or could that flash_backup be an issue?

Quote

October 30, 20241 yr

Community Expert

If you have multiple RAM sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.

Quote

October 30, 20241 yr

Author

I've run extensive testing there and am about 95% sure it isn't RAM. Interestingly - I saw that the writes to my USB were kinda high, sometimes dozens of writes at a time.

image.png.4096e6d30fd71b791fcead710b5a892d.png

For example that pic was after like an hour of runtime. As stated in my previous reply I'm thinking there may be something to that based on the syslog around the crashes. This led me to another thread about a high number of writes and there talked about uninstalling unassigned devices preclear and myservers from plugins. I can't find myservers plugin though (even though that log file references it) I removed preclear as I had that installed and now my USB interactions look like this after 8-9 hours:

That also means I have doubled my uptime average. It does seem that there was one overnight crash after around 9 or 10 hours instead of what had been 3-5... So we're getting somewhere, I think.

Any other thoughts on root cause based on that information?

Quote

October 30, 20241 yr

Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test?

Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU

IMO that's where I would start

Quote

October 30, 20241 yr

Author

38 minutes ago, mathomas3 said:

Take this with a grain of salt... but this kinda feels like a PSU issue... Do you have a spare one you could use/test?

Running the system without the dockers/services does reduce power draw(cpu usage, hdd write/read) making it easier for the PSU

IMO that's where I would start

I have been leaning more and more that direction. I don't have a spare, unfortunately but am watching a bit to see over the next couple days if some of the changes I've done help. Based on the other things I've tested, I'm at its either PSU or UnRAID OS. You might have solidified my choice to grab a spare PSU to have around for spare anyway...

Quote

October 30, 20241 yr

You could also try to boot into a live OS (mint linux or what ever) and run a stress test on it (sure there is something out there you could find to do this, youtube on loop?) if it crashes there then you know... if it doesnt... well that cant really rule out the PSU but... it's something

To eliminate the OS side of the house, you could try a fresh install of unraid but I have found that the OS is good and stable

Just something to consider/try

Quote

October 30, 20241 yr

Author

17 minutes ago, mathomas3 said:

youtube on loop?

lol - in like 20 chrome tabs!

I do have a bootable that I could try and have a psu on its way. Something to try while I wait for sure, although as we speak I'm on my longest uptime in more than two weeks at 12 hours even

Quote

October 30, 20241 yr

having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it

Quote

October 31, 20241 yr

Author

10 hours ago, mathomas3 said:

having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it

Thinking more and more it is power related - just had it crash after a 16 hour uptime randomly. Thought I should cancel parity check since it keeps trying to do it on reboot and then it immediately crashed again which I missed. Just cancelled that parity check and am waiting to see. I tried to up the idle load via bios but yeah. PSU gets here Friday and I'm off that day so I know what I'll be working on...

Quote

October 31, 20241 yr

Author

New update, I am noticing a lot of writes to flash drive again. I checked what files are being updated the most and they are under the .git/objects folder. Its adding 5-10 every few minutes. I had 300 writes an hour ago and am at 600 now. Can't find anything else changing. Not sure why git is updating like crazy...

Quote

November 3, 20241 yr

Author

On 10/30/2024 at 11:35 AM, mathomas3 said:

having random issues like this are always frustrating to work out... if anything get the PSU test it... and send it back if that doesnt solve it

Welp. Got the new psu installed yesterday and the crashing actually got slightly worse if anything. Ruling that out... Noticing again that the writes to flash are ticking up... Can't figure out what is causing that though as the git logs are not showing anything being updated...

Quote

November 3, 20241 yr

Author

Anyone want to give me a sanity check on this pre-boot stuff?

Quote

November 4, 20241 yr

Community Expert

Looks normal to me.

Quote

November 5, 20241 yr

Author

Cool - well here is the latest.

Replaced PSU - no dice

RAM - no dice

Docker turned off - finally made it through full parity check, crashed after parity was done.

Checked bios - no issues to report - all config as it should be

Disks are fine minus a random error on my original disks that just won't clear - the error id translates to 1 errors in 65538 operations. (its the 188 error that is ignored)

I'm in the danger zone at this point - I can't finish parity checks most of the time and because it crashes when parity finishes or is cancelled, I cannot invoke the mover and disks are starting to get into an undesirable state.

Please advise.

EDIT: Crash occurs within roughly 7 min of parity finishing or being cancelled

Edited November 5, 20241 yr by seecs2011
Added context

Quote

November 5, 20241 yr

Community Expert

Board or CPU would be the other main suspects, and since it's a Ryzen CPU, make sure this has been taken care of:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

Quote

November 5, 20241 yr

Author

6 minutes ago, JorgeB said:

Board or CPU would be the other main suspects, and since it's a Ryzen CPU, make sure this has been taken care of:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

as previously mentioned, yes that was done.

Quote

November 5, 20241 yr

Community Expert

Then

34 minutes ago, JorgeB said:

Board or CPU would be the other main suspects

Quote

November 9, 20241 yr

Author

Welp - I've now replaced all components and it is rebooting within a minute of the web portal becoming available now

Quote

November 9, 20241 yr

Author

Worth noting, safe mode also crashes, only stability I've had for more than a minute or two has been the memtest I've been running just to make sure the system can stay up apart from UnRAID OS for a bit...

Quote

November 9, 20241 yr

Author

Something broke in this process - I reinstalled the old hardware and am getting the same behavior

Quote

November 9, 20241 yr

Author

Now it crashed with an Ubuntu live disk...what the hell? I can't figure out what is wrong with this thing

Quote

Back again with more crashing issues!

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)