Jump to content

Random crashes in 6b14/15


doctor15

Recommended Posts

I've been an experiencing an issue for serveral weeks now where unRaid crashes ever 2-3 days.  It usually happens while idle, and the server becomes completely unresponsive locally and via network, so I have force restart via power button.  The most frustrating part is after restart I can't view the logs to see what went wrong.  I finally left the log tailing and hooked it up to a monitor so I could watch it, and did not see any sign of activity during the crash.

 

Every ~20 minutes there was a log entry that it could not communicate with the UPS (I have it plugged it at my desk to troubleshoot and UPS is in network closet).  Around 30 minutes before the last UPS entry there was an entry "spindown(0)", "spindown(1)" etc.

 

I was initially on beta 14b when this started, but recent upgraded to 15 and the issues has not resolved.

 

I know this is not much to go on.. Any suggestions on how I should continue troubleshooting?

Link to comment

I've been an experiencing an issue for serveral weeks now where unRaid crashes ever 2-3 days.  It usually happens while idle, and the server becomes completely unresponsive locally and via network, so I have force restart via power button.  The most frustrating part is after restart I can't view the logs to see what went wrong.  I finally left the log tailing and hooked it up to a monitor so I could watch it, and did not see any sign of activity during the crash.

 

Every ~20 minutes there was a log entry that it could not communicate with the UPS (I have it plugged it at my desk to troubleshoot and UPS is in network closet).  Around 30 minutes before the last UPS entry there was an entry "spindown(0)", "spindown(1)" etc.

 

I was initially on beta 14b when this started, but recent upgraded to 15 and the issues has not resolved.

 

I know this is not much to go on.. Any suggestions on how I should continue troubleshooting?

Are you using any plugins?  What kind of hardware do you have?

Link to comment

For plugins I have powerdown (although it still can't save the logs when the system freezes).  I also have a few docker containers, and an Ubuntu VM running on KVM.

 

Hardware is a Dell T20 Pentium G3220 3.0 GHz w/ 8gb ECC RAM.  For drives I have a 60gb SSD assigned as the cache drive for Docker/KVM, 2x3TB drives in the array and a 3TB parity disk.

 

I should also note I have e-mail alerts enabled and don't recieve any warnings before it crashes.

Link to comment

For plugins I have powerdown (although it still can't save the logs when the system freezes).  I also have a few docker containers, and an Ubuntu VM running on KVM.

 

Hardware is a Dell T20 Pentium G3220 3.0 GHz w/ 8gb ECC RAM.  For drives I have a 60gb SSD assigned as the cache drive for Docker/KVM, 2x3TB drives in the array and a 3TB parity disk.

 

I should also note I have e-mail alerts enabled and don't recieve any warnings before it crashes.

 

Ok.  You mentioned that you've had this issue for several weeks now.  Did this just start out of the blue?  Sounds like there could be a hardware problem.  If you had just upgraded to beta 14 or 15 when the issues started, that would point to software, but if it just started out of the blue, that would lead me to believe it's either hardware or something else going on.

Link to comment

I don't disagree that it might be a hardware issue, but I also have not hard unRaid for that long so I'm not sure it started out of the blue.

 

Regardless.. I'm very unsure how to troubleshoot with the current log situation.  Any suggestions?  The Parity check is running nightly with no issue and I pre-cleared all drives before setting up the Array.

Link to comment

Normal procedure would be to revert to a minimal baseline of hardware and software that works, removing everything possible apart from the basics.  Plus testing the basics to make sure they are reliable.  Then once you have a baseline set of soft/hardware that works, start adding things back, until it fails.

 

So the first step should be a long memtest (from the boot menu), overnight if possible.

Then check inside the system for heat issues (nothing too hot, looks like good airflow to everything, fans all working) and loose cables/connections.  If you can check, make sure CPU temps are reasonable, so you know the CPU-to-heat sync paste/silver is OK.

Then turn off Docker and VM support, stop any parity checks from happening, remove all plugins, and test.  If it still fails, then you have to keep removing more stuff, until it doesn't.

 

And keep that tail running on the monitor ...

Link to comment
  • 4 weeks later...

So I'm still struggling with this issue, slowly removing one component at a time.  This is time consuming since it doesn't crash for several days.

 

I ran the memtest overnight and had no issues.  Every time after it crashes there is nothing in the syslog.  Is there a way I can increase the log level or find a more hardware focused log to tail.

 

One thing I did notice though is that the screen has a few random colored pixels when in different areas after crashing.  Does this sound meaningful to anyone, or just a side effect of the crash?

Link to comment

So I'm still struggling with this issue, slowly removing one component at a time.  This is time consuming since it doesn't crash for several days.

 

I ran the memtest overnight and had no issues.  Every time after it crashes there is nothing in the syslog.  Is there a way I can increase the log level or find a more hardware focused log to tail.

 

One thing I did notice though is that the screen has a few random colored pixels when in different areas after crashing.  Does this sound meaningful to anyone, or just a side effect of the crash?

Does kind of make me wonder about memory. Is this the memory that came on your system? You don't have any overclocking or anything like that set in the BIOS?
Link to comment

No overclocking or anything fancy.  I have 8gb of ECC RAM,a 4gb stick that came with the system and a 4gb that I bought.  I should note its been running stable in this configuration for 1+years (I used to be on FreeNAS), but I know they can still go bad.

 

Should I try running memtest for a few days?  Or just pull out one of the of the sticks?

Link to comment
  • 3 weeks later...

After LOTS of troubleshooting, it turned it was the RAM.  I'm now on 8+ days of uptime!

 

Its odd that running memtest for 24 hours showed nothing, but it stopped happening once I pulled one of the DIMMs.

 

Thanks for the help!

Link to comment

memtest does a good job of checking the chips themselves, but there are other things that could be out of whack occasionally that memtest won't show. So while its a decent tool, passing a memtest is never a definitive answer to "is my RAM ok"

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...