Jump to content
PanteraGSTK

(SOLVED) Random Reboots

21 posts in this topic Last Reply

Recommended Posts

Posted (edited)

It looks like during the past week I've apparently been getting random reboots. From what I've seen in the logs, there doesn't seem to be much of a reason. I figured I would post the diagnostics to see if anyone with more knowledge than me can help.

 

Normally, I'd look to the PSU, but I got this around 4 years ago (Corsair RMX 850) and have had zero issues until now. Board and CPU are both older, but again no issues until now.

 

Let me know what else I can provide.

unraid-diagnostics-20190702-1314.zip

Edited by PanteraGSTK
title update

Share this post


Link to post

It's gotten better (I think unassigned devices was doing something. Removed plugin and it's not throwing so many errors), but it happened again this morning.

 

I have my dockers set to back upunraid-diagnostics-20190706-1714.zip every night and it somehow seems tied to that time of day (night).

Share this post


Link to post

May have found the issue. I have a disk that is dying. Will replace tomorrow and monitor random reboot.

 

Turns out I hadn't modified my SMART settings for this particular drive to work with my 3ware controller.

Share this post


Link to post

It would appear that replacing the failing drive resolved the issue.


I am curious as to why a bad/failing drive would cause reboots though. What would I look at to make sure that was my issue?

Share this post


Link to post
Posted (edited)

That did not solve the problem (got more space now though).

 

Got through a 17hr parity check without any issues. As soon as I start downloading with NZBget the server restarts after about 10 min or so. Very odd.

 

I've started using the syslog function so I've attached that log file. The reboot happened around 6pm (1800)

syslog

Edited by PanteraGSTK

Share this post


Link to post

It could well be worth stopping the array and then restarting in Maintenance mode and then clicking on each array drive on the Main tab in turn and running a file system check.

Share this post


Link to post
4 hours ago, itimpi said:

It could well be worth stopping the array and then restarting in Maintenance mode and then clicking on each array drive on the Main tab in turn and running a file system check.

 

Thanks for the tip.

 

I did that and didn't see any errors, but I'm not all that familiar with xfs_repair

 

I let it repair as needed and all the checks finished very quickly. Only took a few seconds per disk. Not sure if that's good or bad.

Share this post


Link to post
19 minutes ago, PanteraGSTK said:

 

Thanks for the tip.

 

I did that and didn't see any errors, but I'm not all that familiar with xfs_repair

 

I let it repair as needed and all the checks finished very quickly. Only took a few seconds per disk. Not sure if that's good or bad.

Xfs_repair IS very quick, especially if no errors (or just a small number) are found.  Be interesting to see whether it has helped in any way.

Share this post


Link to post
9 minutes ago, itimpi said:

Xfs_repair IS very quick, especially if no errors (or just a small number) are found.  Be interesting to see whether it has helped in any way.

 

I will let you know either way.

Share this post


Link to post
22 hours ago, itimpi said:

Xfs_repair IS very quick, especially if no errors (or just a small number) are found.  Be interesting to see whether it has helped in any way.

 

It would seem everything is OK. 24hrs uptime and I've stress tested and no issues so far.

 

Thanks again for the help.

Share this post


Link to post

Good to hear!    File system corruption should not really cause crashes but for some reason occasionally it does.

Share this post


Link to post

Man, I really thought the new SAS card was going to fix this.

 

I recently removed my SAS2LP card in favor of an LSI HBA, but during a parity check the system rebooted at 85%.

 

I'm not finding anything in logs that tell me where to look.

 

Any ideas on where to start?

Share this post


Link to post
5 hours ago, PanteraGSTK said:

Any ideas on where to start?

Memory. Try removing half and running for a while.

 

This kind of thing is usually hardware. Have you been sitting near or in front of it when it power cycled yet?

Share this post


Link to post
1 hour ago, jonathanm said:

Memory. Try removing half and running for a while.

 

This kind of thing is usually hardware. Have you been sitting near or in front of it when it power cycled yet?

Yeah, I have. Many times. It's in a closet, but I'm 6 feet away from the actual server.

 

I'll pull some memory, but it's a dual proc board so I'll have to be careful.

Share this post


Link to post
32 minutes ago, PanteraGSTK said:

Yeah, I have. Many times. It's in a closet, but I'm 6 feet away from the actual server.

 

I'll pull some memory, but it's a dual proc board so I'll have to be careful.

Hmm. Maybe temporarily pull 1 whole processor and the associated memory.

 

When it reboots, does it sound like a normal power cycle? Is there anything that catches your attention before it actually fully reboots?

Share this post


Link to post
1 hour ago, jonathanm said:

Hmm. Maybe temporarily pull 1 whole processor and the associated memory.

 

When it reboots, does it sound like a normal power cycle? Is there anything that catches your attention before it actually fully reboots?

It does sound like a normal power cycle. That's the strange part. I've not been able to capture anything that points me in any specific direction.

 

When I first got the board and cpus, I had an issue where some memory wasn't recognized. I put them in alternate slots and forgot about it for a few years. I pulled those dimms and am testing again.

 

If it reboots again, I'll pull the associated CPU and remaining dimms and just leave cpu1 and its memory.

Share this post


Link to post

Removing the memory may have helped, but it didn't fix it. I got through an entire parity check (with 2 parity drives at 50mbps). However, when I checked this morning it had rebooted again.

 

Now to remove cpu0 and memory.

Share this post


Link to post
7 hours ago, PanteraGSTK said:

Removing the memory may have helped, but it didn't fix it.

If it increased the uptime in a predictable and repeatable way but didn't solve the issue, that would point to power supply. Not necessarily the PSU itself, there are power supply circuits on the motherboard to further smooth, condition, and regulate the power to the memory and CPU.

 

So, if totally removing 1 cpu and associated RAM "fixes" it, swap to the unused set of CPU and RAM and see if it stays "fixed". If it does, that further implicates the power supply chain. You could have a marginal motherboard power supply that has degraded over time. I've had boards that originally ran 4 sticks of RAM just fine, but later developed instability with more than 2 sticks of perfectly good RAM.

Share this post


Link to post
14 hours ago, jonathanm said:

If it increased the uptime in a predictable and repeatable way but didn't solve the issue, that would point to power supply. Not necessarily the PSU itself, there are power supply circuits on the motherboard to further smooth, condition, and regulate the power to the memory and CPU.

 

So, if totally removing 1 cpu and associated RAM "fixes" it, swap to the unused set of CPU and RAM and see if it stays "fixed". If it does, that further implicates the power supply chain. You could have a marginal motherboard power supply that has degraded over time. I've had boards that originally ran 4 sticks of RAM just fine, but later developed instability with more than 2 sticks of perfectly good RAM.

While removing two sticks of ram seems to have helped, having that ram in netted the same behavior. Sometimes it would be fine, no reboots, sometimes the reboots would happen quickly.

 

Removing CPU0 and associated RAM allowed an additional parity check to complete. No issues. I'll continue to test, but with the fact that I got memory controller errors with CPU0 when I first installed it, but they stopped when I moved the ram around the board points to CPU0 being the culprit.

 

My PSU is a Corsair RMX850 and it's less than 5 years old. I tend to lean toward PSU in these situations too, but with all the other factors I'm confident that CPU0 is most likely the issue.

Share this post


Link to post

Marking this solved...for the third time.

 

I'm pretty confident this time that CPU0 is to blame. The memory all test fine, but the controller issue I got when I first got the CPU/Mobo/RAM combo makes it pretty obvious that is the case.

 

It is possible that Socket 0 is the culprit, but I'll find that out when I move CPU1 to socket 0. I'll also moved the confirmed working RAM. If it all checks out I may grab another xeon 2670 off ebay and hope for the best.

 

I was able to get through a parity check and multiple drives using the file integrity plugin. That plugin was an easy test because when I tried to use it in the past a reboot was pretty much always triggered on my 4TB drives.

 

Another test that passed is downloading large files via usenet while also remuxing a file using makemkv. That's quite a lot of operations at once especially considering my kids were watching movies/tv via plex all at the same time. No reboot. If that didn't make it reboot, nothing will.

 

Hopefully.

 

Thanks for the help.

Edited by PanteraGSTK

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.