Nightly hard crash- what does this screen mean?

January 26, 20188 yr

I was having issues with rebooting VMs, where the whole server would become unresponsive, no shares, no gui, no ssh. Never identified the real reason, but it went away with 6.4

Now I am getting a nightly crash that I can't pinpoint. It's just gone by every morning. During the last round of this I replaced/reseated all sata cables.

I can post diags, but they will be after the hard reboot and my daily parity check. Will they show anything useful? After 9 months, I need to get this stable or abandon UNRAID, it's consuming too much time and energy.

Thanks

Quote

January 26, 20188 yr

Community Expert

Screenshot indicates an I/O error on the loop2 device, that's either the docker image or the libvirt.img, but the problem might be the device where it's stored, e.g., if it's on the cache device and it drops offline those errors are expected.

Quote

January 26, 20188 yr

Author

Ok. I have dual cache drives. Anyway to know which I should pull to see if it stabilizes?

Quote

January 26, 20188 yr

Community Expert

Check the output of:

btrfs dev stats /mnt/cache

Any non zero values means there's a problem.

Quote

January 26, 20188 yr

Author

all zeroes

[/dev/sde1].write_io_errs 0
[/dev/sde1].read_io_errs 0
[/dev/sde1].flush_io_errs 0
[/dev/sde1].corruption_errs 0
[/dev/sde1].generation_errs 0
[/dev/sdd1].write_io_errs 0
[/dev/sdd1].read_io_errs 0
[/dev/sdd1].flush_io_errs 0
[/dev/sdd1].corruption_errs 0
[/dev/sdd1].generation_errs 0

Quote

January 26, 20188 yr

Community Expert

Post your current diags, it might show something.

Quote

January 26, 20188 yr

Author

unraid-diagnostics-20180126-0837.zip

Quote

January 26, 20188 yr

Community Expert

Nothing jumps out, since loop2 was most likely the docker image and it's easy to do I would recommend deleting and recreating it:

Quote

January 26, 20188 yr

Community Expert

Another thing to do would be to shutdown all VMs/docker and turn them on one by one and let it run some time to try and find if it's one of them causing the issues.

Quote

January 26, 20188 yr

Author

Ok, I will try those things over the next few days and report back. Would FCP troubleshooting mode be helpful here? It hasn't in the past when I was able to narrow down the crashes to VM reboots.

Also, I have for months been fighting with Code 43 on two separate video cards. They're still installed but I set that issue aside to deal with the crashes. Could the code 43 be the issue, (or part of)?

Quote

January 26, 20188 yr

Community Expert

6 minutes ago, btrcp2000 said:

Would FCP troubleshooting mode be helpful here?

Possibly, might be worth a try.

7 minutes ago, btrcp2000 said:

Could the code 43 be the issue, (or part of)?

Not really much experience with that, but it's possible.

Quote

January 28, 20188 yr

Author

Didn't have time to do anything until yesterday evening, when all I did was reinstall the Dropbox docker. Woke up this morning and unraid was unresponsive. The 0356 diags are the last before the crash from FCP troubleshooting, hopefully they contain something useful. The 0927 are post reboot, after which a machine check event was detected, which I dont see very often. mcelog is installed via nerdpack, so hopefully the diags explain the mce. Parity check running now.

The Win7 VM has its own Windows Task Scheduler task scheduled to reboot it every night at 4am. that task shows as completed successfully this morning at 4, and the last troubleshooting log occurred at 3:56AM. There is an entry in the libvirt log about the VM domain being tainted. I have had issues for months with VM reboots that I thought went away with 6.4 but something to think about. This was occurring with multiple VMs.

Looking into your earlier suggestions now, thanks as always.

unraid-diagnostics-20180128-0356.zip

unraid-diagnostics-20180128-0927.zip

Quote

February 1, 20188 yr

Author

ANother mystery freeze last night. Nothing interesting on the monitor, but it wouldn't respond to keyboard. Here is another post reboot diag file. Is there anything at all in this one or the ones i posted last friday one message up?

unraid-diagnostics-20180201-0722.zip

Edited February 1, 20188 yr by btrcp2000

Quote

February 1, 20188 yr

Community Expert

On 28/01/2018 at 2:50 PM, btrcp2000 said:

hopefully they contain something useful.

Nothing I can see.

On 28/01/2018 at 2:50 PM, btrcp2000 said:

after which a machine check event was detected, which I dont see very often

It's a hardware error, not sure RAM or CPU but it's not normal and it can explain your issues.

Quote

February 2, 20188 yr

Author

I went ahead and RMA'd my RAM due to the errors that were occasionally happening. Ran 1 pass of memtest overnight just for giggles while I'm waiting for their . Had to run it from a different usb drive as the one on the unraid stick wouldn't start even after being redownloaded from the unraid site. No errors were found, but I saw this in the log after rebooting this morning. Hopefully another piece of the puzzle? Diags also attached

image.png.4d736f18543b6158b32bb178b1d9578b.png unraid-diagnostics-20180202-0849.zip

Quote

February 2, 20188 yr

Community Expert

The first error happens to some users and it's considered harmless if the VM is working normally, the second one is just about some temp files that were since deleted, not an issue.

Quote

February 5, 20188 yr

Author

Removed the stick of RAM that had been generating occasional "correctable" errors. Hard crash again upon rebooting VM. Not sure why I'm continuing to try at this point. I guess it has to be the motherboard or power supply but I apparently I will never know for sure without just dumping both. Unraid is getting a bit too expensive for what it's been worth if it's this difficult to pinpoint issues.

Attached are more diagnostics per the requested process, but these haven't been helpful so far so I'm sure there's nothing here either. Apologies for the negative attitude but I just don't know what to try anymore and this has consumed way more time than just leaving my whole setup on windows, warts and all. Next step is I will move everything I need up and running back to the old windows box, rip the unraid server down to nothing and just let it be a NAS for a while to see if it is a docker or plugin. Really running low on patience and time though.

unraid-diagnostics-20180205-1022 (2).zip

Quote

February 5, 20188 yr

Community Expert

Nothing in the logs jumps out, but IIRC there are a few users with same or similar boards (Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+) reporting crash/hanging, maybe just some incompatibility with the current kernel and those boards.

Quote

February 6, 20188 yr

I would run unRAID on it without plugins, dockers or VMs and no cards not needed for NAS functionality (so just HBAs). If it works for several weeks that way then you can try adding plugins, dockers and VMs one at a time until you encounter another problem. Only once you have all the plugins dockers and VMs working that you want to use then add additional cards like USB, Network and Video cards. Once everything is working with your additional cards you can then try to pass through any to your VMs. This should let you discover a cause of the problems but will be time consuming.

I have an ASRock - EP2C602-4L/D16 that is working well for me with unRAID with lots of Windows VMs. The server has no video out on it. I use IPMI to access the local unRAID console screen and RDP to access the VMs. When I set it up originally I ran it as an unRAID NAS only for several weeks to be sure it was stable before I added VMs or even plugins to it. I had another box that had problems running unRAID 6.4 with a Windows 7 VM. It appeared to work much better (weeks between issues instead of hours or days) - when I disabled USB3 but it still wasn't stable enough that way to try to pass through the iGPU to the VM like I wanted. I'm sure if I had just used unRAID as a NAS on it with maybe only Dockers and no VMs it would have worked. I ended up abandoning the attempt and re-installing Windows 7 bare metal on that box like it was before. Been working for several month now and worked for years with Windows 7 bare metal before the attempt to install unRAID with a Windows VM.

Quote

February 7, 20188 yr

Hi there! Seems like you're having a heck of a time figuring out what's causing these problems, but I'm very curious if you've tried the approach suggested by Bob and johnnie.black:

On 2/5/2018 at 7:21 PM, BobPhoenix said:

I would run unRAID on it without plugins, dockers or VMs and no cards not needed for NAS functionality (so just HBAs). If it works for several weeks that way then you can try adding plugins, dockers and VMs one at a time until you encounter another problem. Only once you have all the plugins dockers and VMs working that you want to use then add additional cards like USB, Network and Video cards. Once everything is working with your additional cards you can then try to pass through any to your VMs. This should let you discover a cause of the problems but will be time consuming.

On 1/26/2018 at 7:51 AM, johnnie.black said:

Another thing to do would be to shutdown all VMs/docker and turn them on one by one and let it run some time to try and find if it's one of them causing the issues.

The basic suggestion here is to start running the system with no plugins, containers, or VMs installed/running. Just shut down those services from the Tools > VM Manager and Tools > Docker pages (set "enable" to no). Let the system run for a few days, maybe trigger a parity check or two to let it run through that process. See if the system remains stable using it like that. If so, next try adding your Docker containers / plugins. Let that run a few days with your containers / plugins. Next I would try adding a VM, but without passing through any PCI or USB devices. Just create a basic Windows or Linux VM and let it run for a few days. Lastly you can start adding PCI pass through and USB assignment if you desire, and see if that works. This is the best approach to take when trying to diagnose a faulty system. More than likely you have some hardware issues at hand here, but until we see what layer of the software is triggering those issues, we really don't know what steps to take.

The fact that you have Machine Check Errors in the log is not a good sign either, and very much indicative of hardware issues. Could be that you have insufficient cooling / airflow. Could be a faulty component. I previously used to get a TON of MCE errors on one of my systems that was directly the result of a CPU heatsink becoming slightly loosened. As soon as I tightened it back down, the errors went away. MCE is a general term that could mean LOTS of different things.

While I definitely appreciate the frustration because no one wants to deal with these kinds of problems, I do have a comment on this:

On 2/5/2018 at 9:26 AM, btrcp2000 said:

Unraid is getting a bit too expensive for what it's been worth if it's this difficult to pinpoint issues.

Whether you use unRAID, Windows, or another Linux distribution, if you have hardware specific issues like these, pinpointing the root cause will require the same approach as suggested by Bob and Johnnie. Operating systems can do a pretty good job highlighting when a component just flat out isn't working, but when something is buggy due to improper cabling, cooling, seating of the device in its slot, etc., software can only do so much.

Let us know how testing goes with the approach suggested by Bob and Johnnie and we'll see what we can do from there.

Quote

February 8, 20188 yr

Author

Thanks Jon and others. Yes, I am frustrated, but I've seen it work enough that it's still worth the effort, so I appreciate your patience. I will be pursuing the suggestions above about shutting down VMs and docker and letting it run, but those are items that are currently performing functions that don't have a backup (ie, no UNRAID, no TV in the entire house), so I need to solve for that first.

In the meantime, here's an update of easier things I am trying for no other reasons than to rule out. Case came with two hot-swappable PSUs, I'm only running one at a time, but crashes have occurred with both. MrRackables has gone radio silent on their RMA offer for now on the ECC RAM, so I still have the 4 sticks, one of which has generated a handful of errors since November. Pulled that one and its mate on the other CPU to see what happens. So far so good, but only one day in.

The crashes do seem to be related to the VMs rebooting again. That went away for a while with 6.4, but that may have been a coincidence. I am really starting to think that my passed-thru Hauppauge Colossus cards (Gen 1) are playing a role as they are well known problem children in SageTV world. Took forever to get them stable on a standalone windows box years ago, but I eventually did. They never generate visible errors, but I know them well enough that they are coming out, probably this weekend.

I just rebooted the VM, which almost always hard freezes everything, but this time it's only the VM tab that isn't working. That tab loads partially, doesn't display the problem VM or allow me to do anything, but otherwise UNRAID seems to be functioning. Grabbed some diags while I could, maybe there's something here that smarter people can discern?

unraid-diagnostics-20180208-1155.zip

Quote

February 8, 20188 yr

1 hour ago, btrcp2000 said:

I am really starting to think that my passed-thru Hauppauge Colossus cards (Gen 1) are playing a role as they are well known problem children in SageTV world.

This wouldn't surprise me either and likely explains your issue with the VMs tab not loading. When you click to load the VMs tab, libvirt tries to "check" all the hardware that is being assigned to that VM. This includes the virtual disk(s) and any PCI/USB devices that you are trying to pass through or assign to the VM. If a device query fails to return a response, the tab may appear to hang or be blank, because VMs that have invalid hardware assigned to them won't load.

What is likely happening is that when the VM is first started, pass through works as intended and everyone is happy. At some point you want to reboot the VM for one reason or another (maybe a Windows update, maybe software crash, who knows) and when you reboot, the card may not be handling a function level reset properly. Some PCI devices were not designed with virtualization in mind, and therefore don't know how to deal with a scenario where an OS is rebooting without the BIOS itself on the host reinitializing. Unfortunately there is no way to predict what devices will work well or not with VMs until we try, and I've seen many users over the past few years mention lots of difficulty in getting these types of cards to work properly in VMs.

Depending on your needs, you could simply opt to go with a HD Homerun type solution which now integrates nicely with Plex itself. These units can attach to your network as standalone boxes and can be managed via Plex (or other solutions).

Quote

February 8, 20188 yr

Author

That makes a lot of sense, thank you for explaining it so clearly. I have an hdhomerun and it is flawless but I also need premium channels (cable card will not work in my area due to flags)

I will try newer versions of the Hauppauge cards. There are both pcie and USB versions, would USB be better in terms of the rebooting? None have Linux drivers so I have to have a win vm

Quote

February 15, 20188 yr

Author

i saw this in the VM log after attempting to reboot it. mean anything? This was a hang again where the VM tab wouldn't fully load, able to recover by turning VMs off and On in VM settings, didn;t have to hard restart the server. Diags also attached

unraid-diagnostics-20180214-2059.zip

Quote

Nightly hard crash- what does this screen mean?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)