PanteraGSTK Posted July 2, 2019 Share Posted July 2, 2019 (edited) It looks like during the past week I've apparently been getting random reboots. From what I've seen in the logs, there doesn't seem to be much of a reason. I figured I would post the diagnostics to see if anyone with more knowledge than me can help. Normally, I'd look to the PSU, but I got this around 4 years ago (Corsair RMX 850) and have had zero issues until now. Board and CPU are both older, but again no issues until now. Let me know what else I can provide. unraid-diagnostics-20190702-1314.zip Edited August 15, 2019 by PanteraGSTK title update Quote Link to comment
Alphahelix Posted July 2, 2019 Share Posted July 2, 2019 Following here, as I have the same issue. Quote Link to comment
PanteraGSTK Posted July 6, 2019 Author Share Posted July 6, 2019 It's gotten better (I think unassigned devices was doing something. Removed plugin and it's not throwing so many errors), but it happened again this morning. I have my dockers set to back upunraid-diagnostics-20190706-1714.zip every night and it somehow seems tied to that time of day (night). Quote Link to comment
PanteraGSTK Posted July 7, 2019 Author Share Posted July 7, 2019 May have found the issue. I have a disk that is dying. Will replace tomorrow and monitor random reboot. Turns out I hadn't modified my SMART settings for this particular drive to work with my 3ware controller. Quote Link to comment
PanteraGSTK Posted July 9, 2019 Author Share Posted July 9, 2019 It would appear that replacing the failing drive resolved the issue. I am curious as to why a bad/failing drive would cause reboots though. What would I look at to make sure that was my issue? Quote Link to comment
PanteraGSTK Posted July 9, 2019 Author Share Posted July 9, 2019 (edited) That did not solve the problem (got more space now though). Got through a 17hr parity check without any issues. As soon as I start downloading with NZBget the server restarts after about 10 min or so. Very odd. I've started using the syslog function so I've attached that log file. The reboot happened around 6pm (1800) syslog Edited July 9, 2019 by PanteraGSTK Quote Link to comment
itimpi Posted July 10, 2019 Share Posted July 10, 2019 It could well be worth stopping the array and then restarting in Maintenance mode and then clicking on each array drive on the Main tab in turn and running a file system check. Quote Link to comment
PanteraGSTK Posted July 10, 2019 Author Share Posted July 10, 2019 4 hours ago, itimpi said: It could well be worth stopping the array and then restarting in Maintenance mode and then clicking on each array drive on the Main tab in turn and running a file system check. Thanks for the tip. I did that and didn't see any errors, but I'm not all that familiar with xfs_repair I let it repair as needed and all the checks finished very quickly. Only took a few seconds per disk. Not sure if that's good or bad. Quote Link to comment
itimpi Posted July 10, 2019 Share Posted July 10, 2019 19 minutes ago, PanteraGSTK said: Thanks for the tip. I did that and didn't see any errors, but I'm not all that familiar with xfs_repair I let it repair as needed and all the checks finished very quickly. Only took a few seconds per disk. Not sure if that's good or bad. Xfs_repair IS very quick, especially if no errors (or just a small number) are found. Be interesting to see whether it has helped in any way. Quote Link to comment
PanteraGSTK Posted July 10, 2019 Author Share Posted July 10, 2019 9 minutes ago, itimpi said: Xfs_repair IS very quick, especially if no errors (or just a small number) are found. Be interesting to see whether it has helped in any way. I will let you know either way. Quote Link to comment
PanteraGSTK Posted July 11, 2019 Author Share Posted July 11, 2019 22 hours ago, itimpi said: Xfs_repair IS very quick, especially if no errors (or just a small number) are found. Be interesting to see whether it has helped in any way. It would seem everything is OK. 24hrs uptime and I've stress tested and no issues so far. Thanks again for the help. Quote Link to comment
itimpi Posted July 11, 2019 Share Posted July 11, 2019 Good to hear! File system corruption should not really cause crashes but for some reason occasionally it does. 1 Quote Link to comment
PanteraGSTK Posted August 13, 2019 Author Share Posted August 13, 2019 Man, I really thought the new SAS card was going to fix this. I recently removed my SAS2LP card in favor of an LSI HBA, but during a parity check the system rebooted at 85%. I'm not finding anything in logs that tell me where to look. Any ideas on where to start? Quote Link to comment
JonathanM Posted August 13, 2019 Share Posted August 13, 2019 5 hours ago, PanteraGSTK said: Any ideas on where to start? Memory. Try removing half and running for a while. This kind of thing is usually hardware. Have you been sitting near or in front of it when it power cycled yet? Quote Link to comment
PanteraGSTK Posted August 13, 2019 Author Share Posted August 13, 2019 1 hour ago, jonathanm said: Memory. Try removing half and running for a while. This kind of thing is usually hardware. Have you been sitting near or in front of it when it power cycled yet? Yeah, I have. Many times. It's in a closet, but I'm 6 feet away from the actual server. I'll pull some memory, but it's a dual proc board so I'll have to be careful. Quote Link to comment
JonathanM Posted August 13, 2019 Share Posted August 13, 2019 32 minutes ago, PanteraGSTK said: Yeah, I have. Many times. It's in a closet, but I'm 6 feet away from the actual server. I'll pull some memory, but it's a dual proc board so I'll have to be careful. Hmm. Maybe temporarily pull 1 whole processor and the associated memory. When it reboots, does it sound like a normal power cycle? Is there anything that catches your attention before it actually fully reboots? Quote Link to comment
PanteraGSTK Posted August 14, 2019 Author Share Posted August 14, 2019 1 hour ago, jonathanm said: Hmm. Maybe temporarily pull 1 whole processor and the associated memory. When it reboots, does it sound like a normal power cycle? Is there anything that catches your attention before it actually fully reboots? It does sound like a normal power cycle. That's the strange part. I've not been able to capture anything that points me in any specific direction. When I first got the board and cpus, I had an issue where some memory wasn't recognized. I put them in alternate slots and forgot about it for a few years. I pulled those dimms and am testing again. If it reboots again, I'll pull the associated CPU and remaining dimms and just leave cpu1 and its memory. Quote Link to comment
PanteraGSTK Posted August 14, 2019 Author Share Posted August 14, 2019 Removing the memory may have helped, but it didn't fix it. I got through an entire parity check (with 2 parity drives at 50mbps). However, when I checked this morning it had rebooted again. Now to remove cpu0 and memory. Quote Link to comment
JonathanM Posted August 14, 2019 Share Posted August 14, 2019 7 hours ago, PanteraGSTK said: Removing the memory may have helped, but it didn't fix it. If it increased the uptime in a predictable and repeatable way but didn't solve the issue, that would point to power supply. Not necessarily the PSU itself, there are power supply circuits on the motherboard to further smooth, condition, and regulate the power to the memory and CPU. So, if totally removing 1 cpu and associated RAM "fixes" it, swap to the unused set of CPU and RAM and see if it stays "fixed". If it does, that further implicates the power supply chain. You could have a marginal motherboard power supply that has degraded over time. I've had boards that originally ran 4 sticks of RAM just fine, but later developed instability with more than 2 sticks of perfectly good RAM. Quote Link to comment
PanteraGSTK Posted August 15, 2019 Author Share Posted August 15, 2019 14 hours ago, jonathanm said: If it increased the uptime in a predictable and repeatable way but didn't solve the issue, that would point to power supply. Not necessarily the PSU itself, there are power supply circuits on the motherboard to further smooth, condition, and regulate the power to the memory and CPU. So, if totally removing 1 cpu and associated RAM "fixes" it, swap to the unused set of CPU and RAM and see if it stays "fixed". If it does, that further implicates the power supply chain. You could have a marginal motherboard power supply that has degraded over time. I've had boards that originally ran 4 sticks of RAM just fine, but later developed instability with more than 2 sticks of perfectly good RAM. While removing two sticks of ram seems to have helped, having that ram in netted the same behavior. Sometimes it would be fine, no reboots, sometimes the reboots would happen quickly. Removing CPU0 and associated RAM allowed an additional parity check to complete. No issues. I'll continue to test, but with the fact that I got memory controller errors with CPU0 when I first installed it, but they stopped when I moved the ram around the board points to CPU0 being the culprit. My PSU is a Corsair RMX850 and it's less than 5 years old. I tend to lean toward PSU in these situations too, but with all the other factors I'm confident that CPU0 is most likely the issue. Quote Link to comment
PanteraGSTK Posted August 15, 2019 Author Share Posted August 15, 2019 (edited) Marking this solved...for the third time. I'm pretty confident this time that CPU0 is to blame. The memory all test fine, but the controller issue I got when I first got the CPU/Mobo/RAM combo makes it pretty obvious that is the case. It is possible that Socket 0 is the culprit, but I'll find that out when I move CPU1 to socket 0. I'll also moved the confirmed working RAM. If it all checks out I may grab another xeon 2670 off ebay and hope for the best. I was able to get through a parity check and multiple drives using the file integrity plugin. That plugin was an easy test because when I tried to use it in the past a reboot was pretty much always triggered on my 4TB drives. Another test that passed is downloading large files via usenet while also remuxing a file using makemkv. That's quite a lot of operations at once especially considering my kids were watching movies/tv via plex all at the same time. No reboot. If that didn't make it reboot, nothing will. Hopefully. Thanks for the help. Edited August 15, 2019 by PanteraGSTK Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.