Unraid suddenly becoming unresponsive and crashing after hours or days

BurntOC · September 1, 2020

This is my first support post so I apologize if I miss anything. I'm trying to follow the guidelines.

I've been running Unraid on a HP Z230 workstation while I wait to buy my new setup in the next couple of months. It ran rock solid for weeks at a time until recently, and the only thing that I think has changed is I've installed my GTX 1050 TI, added a VM or two, and I've tweaked the networking a bit to try to segregate some of the containers and VMs and such and use VLANs I have on my network.

The issue is that after several hours to a day (often experienced overnight) I can't access Unraid via the web UI or SSH. On some rare occasions I've been able to login locally (often I run headless and pass my GTX to the Windows VM using my iGPU when needed for local troubleshooting), but even then there is a delay of several seconds from when I hit the first keypress before Unraid begins to acknowledge my login attempt (at which point it then becomes responsive again). In these cases I've typically done a reboot to restore full functionality. Often I don't catch it here and it is just altogether nonresponsive, with no display output at all and I have to hard power down. If there is a message displayed, it's often about IRQ16, but I don't really have any other PCI slots easily accessible so I haven't been able to move stuff around. In any case, this seems to have started with the GTX and the VM, though to be clear the VM and passthrough seem to work fine until whatever deterioration is setting in occurs.

I've attached a diagnostics zip that I captured the other day. If it shows the network going down you can ignore that as I was doing some work on the switch to which it is attached when I noticed this happening again and I had a chance to run the diagnostics script.

Thanks for any help.

unraid-diagnostics-20200830-0953.zip

trurl · September 1, 2020

Have you done memtest?

BurntOC · September 1, 2020

I have not. Did you see something in the logs or is that just a usual test? I'll look into it now. That is something I suspected could be an issue b/c I'm running 3 DIMMS - 2 that match plus another one (all the same speed, of course). I would've jumped on it, but I believe I added that RAM fairly early on and I'd had great stability going weeks at a time between the reboots I was initiating as I worked on the server. Maybe I added it more closely to when I threw the GTX in and forgot about that.

I'll do that now.

BurntOC · September 1, 2020

7 hours ago, trurl said:

Have you done memtest?

UPDATE: Memtest completed the first pass and no errors. Maybe I could run it overnight to see if something else crops up, but I'd have to thing that's unlikely. I was a bit off about the memory config I have in it, though. Turns out it looks like this:

Slot 0 : 4096 MB DDR3-1600 Micron 1G6E1

Slot 1 : 8192 MB DDR3-1600 Patriot memory

Slot 2 : 4096 MB DDR3-1600 Micron 1G6E1

Slot 3 : 4096 MB DDR3-1600 Micron 1G6E1

I'll get a nice set of matched speed and capacity RAM when I buy my new rig soon, but I was hoping to limp along for a bit. If there's another config that looks more stable let me know - I can probably get away with less for now. I have a 3-4 containers running and I was alloting 8GB to the VM when running, though.

Edited September 2, 2020 by BurntOC

BurntOC · September 3, 2020

So another update. It’s been running for 29 hours since I completed the memtest and all seems well in its current state with the GTX out of the chassis. If I pop it in like I said it runs passed through to the VM without issue, but i imagine I’ll start seeing the crashes again. I’d really hoped someone here could interpret the diagnostics log well enough to figure out what’s up. Maybe it’s a bug. There is a line referencing a BUG in the log, so maybe it just doesn’t handle this combo well. I’ve opened up a bug report as well. Fingers crossed this thing stabilizes when I pop the GPU back in tomorrow.

BurntOC · September 3, 2020

So it ran for about 40 hours or so with the GTX out of the server and I thought it might be working without it at least. Then I checked it this afternoon and it is completely non-responsive and I had to hard power off. So basically I can't run any VMs on Unraid without it crashing at some point, which sucks royally. Here's a screenshot of the local console, if that adds any value for the analysis.

trurl · September 4, 2020

17 hours ago, BurntOC said:

Here's a screenshot of the local console, if that adds any value for the analysis.

Not for me, but this might:

BurntOC · September 4, 2020

First of all - thanks for your response. I posted this and a BUG report, as the diagnostics file references a BUG as well and 3 days later this is the first response I've received from anyone else. Setting up syslog to capture this is definitely in my plans here after I knock out a couple of other things, but I thought that the diagnostics.zip was considered the key starting point. As best I can tell no one's even looked at that, so I don't know that sending a zipped log from my syslog server will help as I'll still need assistance in understanding what it indicated with respect to Unraid and the crashes.

BurntOC · September 7, 2020

Final update: 2 threads (this and a BUG report), 12 days posted amongst the 2 of them, both with diagnostics.zip files and other info requested - 1 reponse pointing me to the FAQ.

EDIT - snipped my comments about support and process frustration /EDIT

Current uptime is 3 days, 17+ hours, well beyond what it had been as of late. I believe the solution was that I pulled the 1 stick of RAM that didn't match the others out of the system. Even though the run of memtest had passed maybe subsequent ones would fail, but the stick has performed perfectly fine in other multifunction server installs. In the config I had here it just seems it was enough of a mismatch that it caused an error.

I'll only update this if it looks like it WASN'T the RAM, but for now I'm considering it case closed.

Edited September 7, 2020 by BurntOC

JorgeB · September 7, 2020

Just because the error mentions BUG it doesn't mean it's an Unraid bug, in fact most likely it's not, and when it's a hardware problem it's very difficult to diagnose remotely, especially if there's nothing logged on the syslog, if this happens again make sure to use the syslog feature linked above and then post that syslog, it might catch the beginning of the error, the screenshots you posted only show the end.

BurntOC · September 7, 2020

I didn’t think that it was a bug, but it reported one in the log and I was prepared to be wrong.

Regarding the sysdig - got it. I just wish that had been emphasized in addition to or in lieu of the diagnostics file from the beginning and I would’ve made it happen. The impression I had was that providing this detail and the diagnostics file was the way to get help, with a syslog requested if more info was needed.

Edited September 7, 2020 by BurntOC

JorgeB · September 7, 2020

Yes, but but let me repeat that's it's very difficult to diagnose most hardware issues remotely, especially when they lockup the server, if there's something on the syslog server, that might help, but most times the way to find out the problem is for the user to start swapping some of the hardware around.

Unraid suddenly becoming unresponsive and crashing after hours or days

Recommended Posts

BurntOC

Link to comment

trurl

Link to comment

BurntOC

Link to comment

BurntOC

Link to comment

BurntOC

Link to comment

BurntOC

Link to comment

trurl

Link to comment

BurntOC

Link to comment

BurntOC

Link to comment

JorgeB

Link to comment

BurntOC

Link to comment

JorgeB

Link to comment

Join the conversation