Jump to content

I'm at my wits end with these daily reboots! Please Help!


Recommended Posts

Posted (edited)
56 minutes ago, Geck0 said:

Seriously? How annoying. It's difficult t9 diagnose when it happens, since you haven't got logs.

I k ow how you feel. I had the same issue for more than a couple of weeks. 

 

Maybe unclean power? Are you connected to mains or UPS?

I run through filtered power. And have a server cluster with ESXi and truenas 3 hosts. Zero issues with those. All part of the same 20amp circuit. Nothing else runs on there but my servers.  Mine has literally been going on for a year. It seems like it started ever since I upgraded the OS like 3 versions ago. But then when I reverted back it was still rebooting. So who knows. 

Edited by Dro
Link to comment
Posted (edited)
1 minute ago, Geck0 said:

List your plugins and dockets.

Check on the 1st page they are all listed about half down. System has run fine in safe mode with all dockers enabled but no plugins enabled.

 

Ive already tried to disable a bunch so far, the main ones suggested like the GPU, Nerdtools, still rebooting =( 

Edited by Dro
Link to comment

That's what I've got for plugins and dockers, fairly unremarkable. I don't have any nVidia plugins, etc. I run a couple of basic Windows 11 VMs. I still don't think its the plugins though.

The only other thing I can think of, is your motherboard's firmware. Sometimes the latest firmware doesn't play nice. I'm guessing if you've tried everything, you've upgraded, etc. and people normally do that when they have a new motherboard.

Have you Googled to check if anybody is having trouble with their firmware version that you're on? (Sorry, if that's obvious).

 

Also, are you using an LSI SAS board? If you are, have you tried moving the drives to the motherboard's SATA?

Have you Memtested?

 

Short of the above, have you got an empty SSD / nvme that you can use with a Windows or Linux build. Boot with it and leave it for a few days to see if this still occurs (leaving the physical part of the server exactly the same). If nothing, trash the unRaid image and start clean.

 

Backups

 

If you're machine is going down as often as this, I would really look at backuping up the important stuff and I mean daily. You're going to end up with a corruption somewhere along the line. 

 

When I was dealing with the similar issue (which was PSU), I eventually trashed the whole lot and started again, but it was an old rig that I'd had in storage for a while.

Link to comment
Posted (edited)
3 hours ago, Geck0 said:

That's what I've got for plugins and dockers, fairly unremarkable. I don't have any nVidia plugins, etc. I run a couple of basic Windows 11 VMs. I still don't think its the plugins though.

The only other thing I can think of, is your motherboard's firmware. Sometimes the latest firmware doesn't play nice. I'm guessing if you've tried everything, you've upgraded, etc. and people normally do that when they have a new motherboard.

Have you Googled to check if anybody is having trouble with their firmware version that you're on? (Sorry, if that's obvious).

 

Also, are you using an LSI SAS board? If you are, have you tried moving the drives to the motherboard's SATA?

Have you Memtested?

 

Short of the above, have you got an empty SSD / nvme that you can use with a Windows or Linux build. Boot with it and leave it for a few days to see if this still occurs (leaving the physical part of the server exactly the same). If nothing, trash the unRaid image and start clean.

 

Backups

 

If you're machine is going down as often as this, I would really look at backuping up the important stuff and I mean daily. You're going to end up with a corruption somewhere along the line. 

 

When I was dealing with the similar issue (which was PSU), I eventually trashed the whole lot and started again, but it was an old rig that I'd had in storage for a while.


This computer, every single hardware component has been replaced, down to every single cable. Completely different motherboards too. The computer prior to this was rebooting too. That’s what lead me to start replacing parts one by one. The psu was the last remaining piece that wasn’t replaced that I replaced this week. I have like maybe 2% of what you have in plugins and dockers lol. My setup is so basic. I don’t have much at all. I Literally use it for just my media server. 
 

I do run an lsi 9211-8i but actually with the drives I have they are directly attached to the board and my other hotswap bays are on the card but I have no drives in those bays. 
 

So the question is, what is possibly happening outside the server that may be causing this? I’ve even replaced all network cables, switches the server is attached to etc. there is not one component that hasn’t been moved or replaced with something new. It’s nuts. 
 

But then, why is the sever rock solid in safe mode? I just had another unclean shutdown about 1hr ago. So far I had one last night at

1220am and now again around 7pm. 🤯

 

my backups are solid. 
 

That is a good idea maybe I’ll load unRAID to a brand new usb drive and just let it run without any configs and see what it does. 

 

 

Edited by Dro
Link to comment

I would load a trial version and don't mount the disks. If that lasts for 24 hours, then mount the disks and give it another 24 hours. 

Then load a basic docker, like Plex and the disks and another 24 hours. If its conking out at any point along this basic line, then you have a problem that has to be physical. There is no way a clean unRaid build, running Plex and some disks can fall over because of the dockers or issues in the base config. Anything is possible, but highly unlikely.

 

If it doesn't conk out, I would use the same usb and re-build the image (don't restore the image) and re-license. 

 

My two cents for what its worth. 🤪

 

 

 

Link to comment
20 hours ago, Dro said:

I run through filtered power. And have a server cluster with ESXi and truenas 3 hosts. Zero issues with those. All part of the same 20amp circuit. Nothing else runs on there but my servers.  Mine has literally been going on for a year. It seems like it started ever since I upgraded the OS like 3 versions ago. But then when I reverted back it was still rebooting. So who knows. 

Here the power runs through a smart switch which measures the energy usage, just got rid of that, let's see.

In my case the computer completely turns off, that's pretty interesting, it's not a simple crash or reboot. Hopefully it's the plug, otherwise I will start from the scratch. 

Link to comment
Posted (edited)
7 hours ago, guiper said:

Here the power runs through a smart switch which measures the energy usage, just got rid of that, let's see.

In my case the computer completely turns off, that's pretty interesting, it's not a simple crash or reboot. Hopefully it's the plug, otherwise I will start from the scratch. 

Hmm, in your bios do you have it set to automatically turn back on after power cuts? This sounds like a power issue or possibly the same issue I have but in my BIOS i had to set it up to automatically power on, restart and also turned off quick boot.. Try that and i bet you may see same issues i am having because mine would just shutoff as well. but i think that was mostly related to quickboot in the bios.

 

And, i also have some power monitoring that everything feeds through.. Good thinking maybe i should remove mine too. Although, no other devices are impacted. This runs switches, servers tons of equipment.

Edited by Dro
Link to comment
7 hours ago, Geck0 said:

I would load a trial version and don't mount the disks. If that lasts for 24 hours, then mount the disks and give it another 24 hours. 

Then load a basic docker, like Plex and the disks and another 24 hours. If its conking out at any point along this basic line, then you have a problem that has to be physical. There is no way a clean unRaid build, running Plex and some disks can fall over because of the dockers or issues in the base config. Anything is possible, but highly unlikely.

 

If it doesn't conk out, I would use the same usb and re-build the image (don't restore the image) and re-license. 

 

My two cents for what its worth. 🤪

 

 

 

appreciate the ideas, because im out of them lol.. so if i rebuild the image, i would have to reconfigure every docker, etc right? also, what about the data on the disks?

Link to comment
Posted (edited)

OH MY GOD.. I just pulled out my server rack thats in my server room, its under the stairs in the basement built just for this purpose etc.. I haven't been behind there in a long time. So i go in there and saw that my unRAID server was plugged into a smartswitch.. How much you want to bet its the stupid smartswitch causing some kind of reboots? IT HAS TO BE!!!  I just removed it. I never really removed the cable from the smartswitch not thinking anything of it because i have spares back there and was plugging them from the front end of the plug.. Fingers crossed this is it!!! There is hope! 🤣

Edited by Dro
Link to comment
Posted (edited)
2 hours ago, Geck0 said:

How much money abd time did you spend on this issue? 🤣

Well let’s just say top

of the line 13th gen i9 setup with high end stuff 😂 but I’m not sure it’s the issue BUT it kinda makes sense doesn’t it? But why didn’t it happen in safe mode? That’s the only “wtf” right now. Time will tell. 

Edited by Dro
Link to comment
43 minutes ago, Dro said:

Well let’s just say top

of the line 13th gen i9 setup with high end stuff 😂 but I’m not sure it’s the issue BUT it kinda makes sense doesn’t it? But why didn’t it happen in safe mode? That’s the only “wtf” right now. Time will tell. 

A few months ago, I started having random reboots (sometimes complete shutdowns) on my 13900k setup.  I set up a remote syslog and could never find anything in the logs when it would happen.  Like others here I systematically replaced everything other than the CPU with the same result. 

 

It wasn't until trying to boot some other OSs I had laying around on other USB sticks that something caught my eye in the logs on boot... I think it was when I was booting PFSense.

 

Anyway, the error was something about how Core 20 had some kind of fault. 

 

That got me searching around and finding all sorts of information about 13th and 14th gen chips degrading and going bad at an alarming rate.  It seems that many motherboard manufacturers had pretty bad settings for these chips that were destroying them.

 

I ended up disabling the e-cores in the BIOS and after that I had an uptime of 2 months again.  Everything was completely stable as soon as I did that.

 

Maybe something to look into?

 

I just swapped out the 13900k for a 14900k this evening and I'm going to see if I can RMA the 13900k.  I ordered the 14900k almost two months ago (same as the uptime I had up until today) and it's just been sitting.  I put off swapping it out because while it seemed urgent at the time... it turns out that running the server with the e-cores disabled made no noticeable difference in my life and I wanted to see how long it would run without freezing/rebooting.  It was nice having a working server again for a couple months.  But I'm back up and running on the 14900k now with all cores enabled and hoping it doesn't happen again with this chip.

 

 

Link to comment
Posted (edited)
15 minutes ago, iankaufmann said:

A few months ago, I started having random reboots (sometimes complete shutdowns) on my 13900k setup.  I set up a remote syslog and could never find anything in the logs when it would happen.  Like others here I systematically replaced everything other than the CPU with the same result. 

 

It wasn't until trying to boot some other OSs I had laying around on other USB sticks that something caught my eye in the logs on boot... I think it was when I was booting PFSense.

 

Anyway, the error was something about how Core 20 had some kind of fault. 

 

That got me searching around and finding all sorts of information about 13th and 14th gen chips degrading and going bad at an alarming rate.  It seems that many motherboard manufacturers had pretty bad settings for these chips that were destroying them.

 

I ended up disabling the e-cores in the BIOS and after that I had an uptime of 2 months again.  Everything was completely stable as soon as I did that.

 

Maybe something to look into?

 

I just swapped out the 13900k for a 14900k this evening and I'm going to see if I can RMA the 13900k.  I ordered the 14900k almost two months ago (same as the uptime I had up until today) and it's just been sitting.  I put off swapping it out because while it seemed urgent at the time... it turns out that running the server with the e-cores disabled made no noticeable difference in my life and I wanted to see how long it would run without freezing/rebooting.  It was nice having a working server again for a couple months.  But I'm back up and running on the 14900k now with all cores enabled and hoping it doesn't happen again with this chip.

 

 

Great info. Thank you. If it continues I will check that out. I have a 13900k in my main PC too with no issues. First I am hearing about this. Although I was having reboot issues prior to the 13900k. This CPU replaced an 8700k. 

Edited by Dro
Link to comment

Simple, in safe mode you're on a reduced system. No plugins, etc. This results in lower wattage. Go to your normal system, that's pushing all the plugins, graphics cards vm's, parity checks, etc. Its under more load, more processing, more fan usage / water cooling pumps.

The smart swtich will likely have a maximum load before it trips, ook up the model.Especially if you have other devices on the same plug (don't know what kind of smart switch).

Quote

 

Well let’s just say top

of the line 13th gen i9 setup with high end stuff 😂 but I’m not sure it’s the issue BUT it kinda makes sense doesn’t it? But why didn’t it happen in safe mode? That’s the only “wtf” right now. Time will tell. 

Edited 6 hours ago by Dro

 

 

Edited by Geck0
Link to comment
Posted (edited)
2 hours ago, Geck0 said:

Simple, in safe mode you're on a reduced system. No plugins, etc. This results in lower wattage. Go to your normal system, that's pushing all the plugins, graphics cards vm's, parity checks, etc. Its under more load, more processing, more fan usage / water cooling pumps.

The smart swtich will likely have a maximum load before it trips, ook up the model.Especially if you have other devices on the same plug (don't know what kind of smart switch).

 

It’s a Kasa smart switch they can handle a ton like 1800 watts or something so I don’t think it’s a wattage issue and just Unraid was plugged into it. My entire server rack was only using around 500. And I remember now. The whole reason I installed it over a year ago was because my old system would black screen every so often and become unresponsive. So if I was gone I could remotely reboot it and the server would work again. Once I figured out that issue the server ran fine for awhile but then started to randomly reboot out of nowhere. I can’t recall adding anymore hardware during that time that would have changed anything. It seemed to start after upgrading the OS And there began my journey slowly replacing every.single.part lol I actually made a similar post to this awhile ago but in my logs it was actually showing some errors. So I fixed that stuff but reboots continued. Now logs show nothing. 

Edited by Dro
Link to comment
15 hours ago, iankaufmann said:

That got me searching around and finding all sorts of information about 13th and 14th gen chips degrading and going bad at an alarming rate.  It seems that many motherboard manufacturers had pretty bad settings for these chips that were destroying them.

It's not the motherboard. Hopefully someone gets to the bottom of why Intel 13xxx and 14xxx CPUs are so horrible:

 

 

 

Link to comment
19 minutes ago, Geck0 said:

Yeah, I was wondering if you'd taken a hammer to it 😄

Well, no smartswitch, stable so far. I have a feeling that was the problem. No hammer YET.  I swear if this wasn’t the problem you’ll see fireworks over Chicago from anywhere in the US. 😂

Link to comment
17 hours ago, Dro said:

Well, no smartswitch, stable so far. I have a feeling that was the problem. No hammer YET.  I swear if this wasn’t the problem you’ll see fireworks over Chicago from anywhere in the US. 😂

Same thing here!!! Getting rid of the smart switch apparently solved the issue and no more shutdowns! 4 days without any issues, which is a record!  

Strange that there was another PC on the same smart switch (mostly on standby) that didn't get impacted with any shutdowns... probably the load around 30-40W triggers the issue...

 

Thanks for helping sorting out the issue, I would take a while to get there. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...