Jump to content

Unraid suddenly becoming unresponsive and crashing after hours or days


BurntOC

Recommended Posts

This is my first support post so I apologize if I miss anything.  I'm trying to follow the guidelines.

 

I've been running Unraid on a HP Z230 workstation while I wait to buy my new setup in the next couple of months.  It ran rock solid for weeks at a time until recently, and the only thing that I think has changed is I've installed my GTX 1050 TI, added a VM or two, and I've tweaked the networking a bit to try to segregate some of the containers and VMs and such and use VLANs I have on my network.

 

The issue is that after several hours to a day (often experienced overnight) I can't access Unraid via the web UI or SSH.  On some rare occasions I've been able to login locally (often I run headless and pass my GTX to the Windows VM using my iGPU when needed for local troubleshooting), but even then there is a delay of several seconds from when I hit the first keypress before Unraid begins to acknowledge my login attempt (at which point it then becomes responsive again).  In these cases I've typically done a reboot to restore full functionality.  Often I don't catch it here and it is just altogether nonresponsive, with no display output at all and I have to hard power down.    If there is a message displayed, it's often about IRQ16, but I don't really have any other PCI slots easily accessible so I haven't been able to move stuff around.  In any case, this seems to have started with the GTX and the VM, though to be clear the VM and passthrough seem to work fine until whatever deterioration is setting in occurs.

 

I've attached a diagnostics zip that I captured the other day.  If it shows the network going down you can ignore that as I was doing some work on the switch to which it is attached when I noticed this happening again and I had a chance to run the diagnostics script. 

 

Thanks for any help.

unraid-diagnostics-20200830-0953.zip

Link to comment

I have not.  Did you see something in the logs or is that just a usual test?  I'll look into it now.  That is something I suspected could be an issue b/c I'm running 3 DIMMS - 2 that match plus another one (all the same speed, of course).  I would've jumped on it, but I believe I added that RAM fairly early on and I'd had great stability going weeks at a time between the reboots I was initiating as I worked on the server.  Maybe I added it more closely to when I threw the GTX in and forgot about that.

 

I'll do that now.

Link to comment
7 hours ago, trurl said:

Have you done memtest?

UPDATE:  Memtest completed the first pass and no errors.  Maybe I could run it overnight to see if something else crops up, but I'd have to thing that's unlikely.  I was a bit off about the memory config I have in it, though.  Turns out it looks like this:

 

Slot 0 : 4096 MB DDR3-1600 Micron 1G6E1

Slot 1 : 8192 MB DDR3-1600 Patriot memory

Slot 2 4096 MB DDR3-1600 Micron 1G6E1

Slot 3 4096 MB DDR3-1600 Micron 1G6E1

 

I'll get a nice set of matched speed and capacity RAM when I buy my new rig soon, but I was hoping to limp along for a bit.  If there's another config that looks more stable let me know - I can probably get away with less for now.  I have a 3-4 containers running and I was alloting 8GB to the VM when running, though.

 

Edited by BurntOC
Link to comment

So another update.  It’s been running for 29 hours since I completed the memtest and all seems well in its current state with the GTX out of the chassis.  If I pop it in like I said it runs passed through to the VM without issue, but i imagine I’ll start seeing the crashes again.   I’d really hoped someone here could interpret the diagnostics log well enough to figure out what’s up.  Maybe it’s a bug.   There is a line referencing a BUG in the log, so maybe it just doesn’t handle this combo well.  I’ve opened up a bug report as well.  Fingers crossed this thing stabilizes when I pop the GPU back in tomorrow.

Link to comment

So it ran for about 40 hours or so with the GTX out of the server and I thought it might be working without it at least.  Then I checked it this afternoon and it is completely non-responsive and I had to hard power off.  So basically I can't run any VMs on Unraid without it crashing at some point, which sucks royally.  Here's a screenshot of the local console, if that adds any value for the analysis.

IMG_20200903_154013.jpg

Link to comment

First of all - thanks for your response.  I posted this and a BUG report, as the diagnostics file references a BUG as well and 3 days later this is the first response I've received from anyone else.  Setting up syslog to capture this is definitely in my plans here after I knock out a couple of other things, but I thought that the diagnostics.zip was considered the key starting point.    As best I can tell no one's even looked at that, so I don't know that sending a zipped log from my syslog server will help as I'll still need assistance in understanding what it indicated with respect to Unraid and the crashes.

Link to comment

Final update:  2 threads (this and a BUG report), 12 days posted amongst the 2 of them, both with diagnostics.zip files and other info requested - 1 reponse pointing me to the FAQ. 
 

EDIT - snipped my comments about support and process frustration /EDIT

 

Current uptime is 3 days, 17+ hours, well beyond what it had been as of late.  I believe the solution was that I pulled the 1 stick of RAM that didn't match the others out of the system. Even though the run of memtest had passed maybe subsequent ones would fail, but the stick has performed perfectly fine in other multifunction server installs.  In the config I had here it just seems it was enough of a mismatch that it caused an error.

 

I'll only update this if it looks like it WASN'T the RAM, but for now I'm considering it case closed.

Edited by BurntOC
Link to comment

Just because the error mentions BUG it doesn't mean it's an Unraid bug, in fact most likely it's not, and when it's a hardware problem it's very difficult to diagnose remotely, especially if there's nothing logged on the syslog, if this happens again make sure to use the syslog feature linked above and then post that syslog, it might catch the beginning of the error, the screenshots you posted only show the end.

Link to comment

I didn’t think that it was a bug, but it reported one in the log and I was prepared to be wrong.  
 

Regarding the sysdig - got it. I just wish that had been emphasized in addition to or in lieu of the diagnostics file from the beginning and I would’ve made it happen. The impression I had was that providing this detail and the diagnostics file was the way to get help, with a syslog requested if more info was needed. 

Edited by BurntOC
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...