(SOLVED) Random Reboots


Recommended Posts

It looks like during the past week I've apparently been getting random reboots. From what I've seen in the logs, there doesn't seem to be much of a reason. I figured I would post the diagnostics to see if anyone with more knowledge than me can help.

 

Normally, I'd look to the PSU, but I got this around 4 years ago (Corsair RMX 850) and have had zero issues until now. Board and CPU are both older, but again no issues until now.

 

Let me know what else I can provide.

unraid-diagnostics-20190702-1314.zip

Edited by PanteraGSTK
title update
Link to comment

That did not solve the problem (got more space now though).

 

Got through a 17hr parity check without any issues. As soon as I start downloading with NZBget the server restarts after about 10 min or so. Very odd.

 

I've started using the syslog function so I've attached that log file. The reboot happened around 6pm (1800)

syslog

Edited by PanteraGSTK
Link to comment
4 hours ago, itimpi said:

It could well be worth stopping the array and then restarting in Maintenance mode and then clicking on each array drive on the Main tab in turn and running a file system check.

 

Thanks for the tip.

 

I did that and didn't see any errors, but I'm not all that familiar with xfs_repair

 

I let it repair as needed and all the checks finished very quickly. Only took a few seconds per disk. Not sure if that's good or bad.

Link to comment
19 minutes ago, PanteraGSTK said:

 

Thanks for the tip.

 

I did that and didn't see any errors, but I'm not all that familiar with xfs_repair

 

I let it repair as needed and all the checks finished very quickly. Only took a few seconds per disk. Not sure if that's good or bad.

Xfs_repair IS very quick, especially if no errors (or just a small number) are found.  Be interesting to see whether it has helped in any way.

Link to comment
22 hours ago, itimpi said:

Xfs_repair IS very quick, especially if no errors (or just a small number) are found.  Be interesting to see whether it has helped in any way.

 

It would seem everything is OK. 24hrs uptime and I've stress tested and no issues so far.

 

Thanks again for the help.

Link to comment
1 hour ago, jonathanm said:

Memory. Try removing half and running for a while.

 

This kind of thing is usually hardware. Have you been sitting near or in front of it when it power cycled yet?

Yeah, I have. Many times. It's in a closet, but I'm 6 feet away from the actual server.

 

I'll pull some memory, but it's a dual proc board so I'll have to be careful.

Link to comment
32 minutes ago, PanteraGSTK said:

Yeah, I have. Many times. It's in a closet, but I'm 6 feet away from the actual server.

 

I'll pull some memory, but it's a dual proc board so I'll have to be careful.

Hmm. Maybe temporarily pull 1 whole processor and the associated memory.

 

When it reboots, does it sound like a normal power cycle? Is there anything that catches your attention before it actually fully reboots?

Link to comment
1 hour ago, jonathanm said:

Hmm. Maybe temporarily pull 1 whole processor and the associated memory.

 

When it reboots, does it sound like a normal power cycle? Is there anything that catches your attention before it actually fully reboots?

It does sound like a normal power cycle. That's the strange part. I've not been able to capture anything that points me in any specific direction.

 

When I first got the board and cpus, I had an issue where some memory wasn't recognized. I put them in alternate slots and forgot about it for a few years. I pulled those dimms and am testing again.

 

If it reboots again, I'll pull the associated CPU and remaining dimms and just leave cpu1 and its memory.

Link to comment
7 hours ago, PanteraGSTK said:

Removing the memory may have helped, but it didn't fix it.

If it increased the uptime in a predictable and repeatable way but didn't solve the issue, that would point to power supply. Not necessarily the PSU itself, there are power supply circuits on the motherboard to further smooth, condition, and regulate the power to the memory and CPU.

 

So, if totally removing 1 cpu and associated RAM "fixes" it, swap to the unused set of CPU and RAM and see if it stays "fixed". If it does, that further implicates the power supply chain. You could have a marginal motherboard power supply that has degraded over time. I've had boards that originally ran 4 sticks of RAM just fine, but later developed instability with more than 2 sticks of perfectly good RAM.

Link to comment
14 hours ago, jonathanm said:

If it increased the uptime in a predictable and repeatable way but didn't solve the issue, that would point to power supply. Not necessarily the PSU itself, there are power supply circuits on the motherboard to further smooth, condition, and regulate the power to the memory and CPU.

 

So, if totally removing 1 cpu and associated RAM "fixes" it, swap to the unused set of CPU and RAM and see if it stays "fixed". If it does, that further implicates the power supply chain. You could have a marginal motherboard power supply that has degraded over time. I've had boards that originally ran 4 sticks of RAM just fine, but later developed instability with more than 2 sticks of perfectly good RAM.

While removing two sticks of ram seems to have helped, having that ram in netted the same behavior. Sometimes it would be fine, no reboots, sometimes the reboots would happen quickly.

 

Removing CPU0 and associated RAM allowed an additional parity check to complete. No issues. I'll continue to test, but with the fact that I got memory controller errors with CPU0 when I first installed it, but they stopped when I moved the ram around the board points to CPU0 being the culprit.

 

My PSU is a Corsair RMX850 and it's less than 5 years old. I tend to lean toward PSU in these situations too, but with all the other factors I'm confident that CPU0 is most likely the issue.

Link to comment

Marking this solved...for the third time.

 

I'm pretty confident this time that CPU0 is to blame. The memory all test fine, but the controller issue I got when I first got the CPU/Mobo/RAM combo makes it pretty obvious that is the case.

 

It is possible that Socket 0 is the culprit, but I'll find that out when I move CPU1 to socket 0. I'll also moved the confirmed working RAM. If it all checks out I may grab another xeon 2670 off ebay and hope for the best.

 

I was able to get through a parity check and multiple drives using the file integrity plugin. That plugin was an easy test because when I tried to use it in the past a reboot was pretty much always triggered on my 4TB drives.

 

Another test that passed is downloading large files via usenet while also remuxing a file using makemkv. That's quite a lot of operations at once especially considering my kids were watching movies/tv via plex all at the same time. No reboot. If that didn't make it reboot, nothing will.

 

Hopefully.

 

Thanks for the help.

Edited by PanteraGSTK
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.