doctor15 Posted April 21, 2015 Share Posted April 21, 2015 I've been an experiencing an issue for serveral weeks now where unRaid crashes ever 2-3 days. It usually happens while idle, and the server becomes completely unresponsive locally and via network, so I have force restart via power button. The most frustrating part is after restart I can't view the logs to see what went wrong. I finally left the log tailing and hooked it up to a monitor so I could watch it, and did not see any sign of activity during the crash. Every ~20 minutes there was a log entry that it could not communicate with the UPS (I have it plugged it at my desk to troubleshoot and UPS is in network closet). Around 30 minutes before the last UPS entry there was an entry "spindown(0)", "spindown(1)" etc. I was initially on beta 14b when this started, but recent upgraded to 15 and the issues has not resolved. I know this is not much to go on.. Any suggestions on how I should continue troubleshooting? Link to comment
jonp Posted April 21, 2015 Share Posted April 21, 2015 I've been an experiencing an issue for serveral weeks now where unRaid crashes ever 2-3 days. It usually happens while idle, and the server becomes completely unresponsive locally and via network, so I have force restart via power button. The most frustrating part is after restart I can't view the logs to see what went wrong. I finally left the log tailing and hooked it up to a monitor so I could watch it, and did not see any sign of activity during the crash. Every ~20 minutes there was a log entry that it could not communicate with the UPS (I have it plugged it at my desk to troubleshoot and UPS is in network closet). Around 30 minutes before the last UPS entry there was an entry "spindown(0)", "spindown(1)" etc. I was initially on beta 14b when this started, but recent upgraded to 15 and the issues has not resolved. I know this is not much to go on.. Any suggestions on how I should continue troubleshooting? Are you using any plugins? What kind of hardware do you have? Link to comment
doctor15 Posted April 21, 2015 Author Share Posted April 21, 2015 For plugins I have powerdown (although it still can't save the logs when the system freezes). I also have a few docker containers, and an Ubuntu VM running on KVM. Hardware is a Dell T20 Pentium G3220 3.0 GHz w/ 8gb ECC RAM. For drives I have a 60gb SSD assigned as the cache drive for Docker/KVM, 2x3TB drives in the array and a 3TB parity disk. I should also note I have e-mail alerts enabled and don't recieve any warnings before it crashes. Link to comment
jonp Posted April 21, 2015 Share Posted April 21, 2015 For plugins I have powerdown (although it still can't save the logs when the system freezes). I also have a few docker containers, and an Ubuntu VM running on KVM. Hardware is a Dell T20 Pentium G3220 3.0 GHz w/ 8gb ECC RAM. For drives I have a 60gb SSD assigned as the cache drive for Docker/KVM, 2x3TB drives in the array and a 3TB parity disk. I should also note I have e-mail alerts enabled and don't recieve any warnings before it crashes. Ok. You mentioned that you've had this issue for several weeks now. Did this just start out of the blue? Sounds like there could be a hardware problem. If you had just upgraded to beta 14 or 15 when the issues started, that would point to software, but if it just started out of the blue, that would lead me to believe it's either hardware or something else going on. Link to comment
doctor15 Posted April 21, 2015 Author Share Posted April 21, 2015 I don't disagree that it might be a hardware issue, but I also have not hard unRaid for that long so I'm not sure it started out of the blue. Regardless.. I'm very unsure how to troubleshoot with the current log situation. Any suggestions? The Parity check is running nightly with no issue and I pre-cleared all drives before setting up the Array. Link to comment
RobJ Posted April 21, 2015 Share Posted April 21, 2015 Normal procedure would be to revert to a minimal baseline of hardware and software that works, removing everything possible apart from the basics. Plus testing the basics to make sure they are reliable. Then once you have a baseline set of soft/hardware that works, start adding things back, until it fails. So the first step should be a long memtest (from the boot menu), overnight if possible. Then check inside the system for heat issues (nothing too hot, looks like good airflow to everything, fans all working) and loose cables/connections. If you can check, make sure CPU temps are reasonable, so you know the CPU-to-heat sync paste/silver is OK. Then turn off Docker and VM support, stop any parity checks from happening, remove all plugins, and test. If it still fails, then you have to keep removing more stuff, until it doesn't. And keep that tail running on the monitor ... Link to comment
doctor15 Posted May 20, 2015 Author Share Posted May 20, 2015 So I'm still struggling with this issue, slowly removing one component at a time. This is time consuming since it doesn't crash for several days. I ran the memtest overnight and had no issues. Every time after it crashes there is nothing in the syslog. Is there a way I can increase the log level or find a more hardware focused log to tail. One thing I did notice though is that the screen has a few random colored pixels when in different areas after crashing. Does this sound meaningful to anyone, or just a side effect of the crash? Link to comment
trurl Posted May 20, 2015 Share Posted May 20, 2015 So I'm still struggling with this issue, slowly removing one component at a time. This is time consuming since it doesn't crash for several days. I ran the memtest overnight and had no issues. Every time after it crashes there is nothing in the syslog. Is there a way I can increase the log level or find a more hardware focused log to tail. One thing I did notice though is that the screen has a few random colored pixels when in different areas after crashing. Does this sound meaningful to anyone, or just a side effect of the crash? Does kind of make me wonder about memory. Is this the memory that came on your system? You don't have any overclocking or anything like that set in the BIOS? Link to comment
doctor15 Posted May 20, 2015 Author Share Posted May 20, 2015 No overclocking or anything fancy. I have 8gb of ECC RAM,a 4gb stick that came with the system and a 4gb that I bought. I should note its been running stable in this configuration for 1+years (I used to be on FreeNAS), but I know they can still go bad. Should I try running memtest for a few days? Or just pull out one of the of the sticks? Link to comment
doctor15 Posted May 20, 2015 Author Share Posted May 20, 2015 Oh, also I should add on that I'm on 6rc2 now. Link to comment
doctor15 Posted June 6, 2015 Author Share Posted June 6, 2015 After LOTS of troubleshooting, it turned it was the RAM. I'm now on 8+ days of uptime! Its odd that running memtest for 24 hours showed nothing, but it stopped happening once I pulled one of the DIMMs. Thanks for the help! Link to comment
jonp Posted June 6, 2015 Share Posted June 6, 2015 memtest does a good job of checking the chips themselves, but there are other things that could be out of whack occasionally that memtest won't show. So while its a decent tool, passing a memtest is never a definitive answer to "is my RAM ok" Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.