Unraid - GUI and VM Offline Problems

mytech · July 17

Hi,

On a couple of occasions now, today being one of them, I've had an alert from my monitoring software to say that my VM has gone offline - I've confirmed this as I'm unable to access or ping the VM.

However, upon checking the UNRAID machine itself, I'm unable to access the UNRAID GUI (via any address/IP), nor can I SSH into it. It does, however, still respond to ping.

This has happened twice now from memory in the past month or so, and I'm at a loss. I saw another post about checking if rootfs was full, but that's not the case on this server:

The only disk capacity issues I've noticed is that I had around 150 emails yesterday from the UNRAID server saying that the docker-cache drive (sde) was 71% full. I didn't have time to explore that yesterday or today, but I doubt that this is related.

I've attached my diagnostics in the hope that someone more skilled than I can notice anything wrong? The VM flagged as being offline around 12:55.

Thank you in advance.

Kind Regards,

Callum

bwd-ccn-hv01-diagnostics-20240717-1834.zip

itimpi · July 17

The syslog in the diagnostics is the RAM copy and only shows what happened since the reboot so does not include the time you mention. It could be worth enabling the syslog server to get a log that survives a reboot so we can see what happened prior to the reboot. The mirror to flash option is the easiest to set up, but if you are worried about excessive wear on the flash drive you can put your server’s address into the Remote Server field.

mytech · July 17

1 minute ago, itimpi said:

The syslog in the diagnostics is the RAM copy and only shows what happened since the reboot so does not include the time you mention. It could be worth enabling the syslog server to get a log that survives a reboot so we can see what happened prior to the reboot. The mirror to flash option is the easiest to set up, but if you are worried about excessive wear on the flash drive you can put your server’s address into the Remote Server field.

Ah, okay. I did have a quick look inside and thought that might be the case.

So for now, looks like I'll need to enable syslog and wait until it happens again?

Thank you.

mytech · August 1

As requested, please see attached the syslog file.

The server had this issue again today (1st August), sometime after 04:00am.

I've looked through the log but can't see anything obvious.

Thank you in advance.

syslog-127.0.0.1.log

JorgeB · August 2

Unfortunately there's nothing relevant logged, this can be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

mytech · August 4

On 8/2/2024 at 7:39 AM, JorgeB said:

Unfortunately there's nothing relevant logged, this can be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

It's really frustrating because the issue occurs randomly - it can sometimes take weeks for it to happen again, so I don't think I'd be able to put it in safe mode, as I'd have to wait weeks without VMs or dockers.

The only other thing I can think to try is removing a couple of CPUs from one of the VMs that seemed to have all of them assigned to it. I'm just wondering if at some point the CPU gets pegged at 100% causing things to crash maybe...

CleanShot2024-08-04at10_50_04.png.7465b4222cb41345e70a963efec07de8.png

I've now removed 0/8 and 2/10.

JonathanM · August 4

10 hours ago, mytech said:

all of them assigned to it

The VM's will likely perform better with far fewer cores dedicated. Remove all except 1 pair, 7/15, and see how the VM feels. Add 6/14 and test again. Repeat adding from the high numbers down until the VM doesn't perform better, then back off one pair.

Always leave 0/8 unassigned, the host needs it.

If you have multiple VM's running concurrently, you may need to test different combos, but always try less cores rather than more. The more resources you can let the host use, the better it can serve the VM with I/O and other emulated services.

Same goes for RAM, even more than CPU cores. RAM dedicated to the VM is lost to the host, so you never want to allocate more than is absolutely necessary. RAM available to the host will be used to cache I/O, which really helps the VM feel snappy.

mytech · August 4

1 minute ago, JonathanM said:

The VM's will likely perform better with far fewer cores dedicated. Remove all except 1 pair, 7/15, and see how the VM feels. Add 6/14 and test again. Repeat adding from the high numbers down until the VM doesn't perform better, then back off one pair.

Always leave 0/8 unassigned, the host needs it.

If you have multiple VM's running concurrently, you may need to test different combos, but always try less cores rather than more. The more resources you can let the host use, the better it can serve the VM with I/O and other emulated services.

Same goes for RAM, even more than CPU cores. RAM dedicated to the VM is lost to the host, so you never want to allocate more than is absolutely necessary. RAM available to the host will be used to cache I/O, which really helps the VM feel snappy.

Thanks for this - it’s really useful.

The VM in question is a Windows Server 2022 machine running Plex/Jellyfin. It’s got 32GB of RAM (system has 64GB total) and an RTX 3060 passed through to it. I wanted to give it as much resource as possible because we were having problems with Plex transcoding stuttering, so I was hoping a lot of resources would help.

I have seen Plex use 100% of the CPU when scanning library files, but I don’t mind that taking longer with the fewer cores given that it only does that from time to time.

I’m hopeful that reducing the cores to the VM will solve the stability issues without impacting Plex! 🤞

JonathanM · August 4

Just now, mytech said:

The VM in question is a Windows Server 2022 machine running Plex/Jellyfin.

Why? Both plex and jellyfin have docker containers available which utilize the system resources much more efficiently than piping them through a VM.

mytech · August 4

Just now, JonathanM said:

Why? Both plex and jellyfin have docker containers available which utilize the system resources much more efficiently than piping them through a VM.

I had it as the Docker container initially and that’s where we were having problems. Transcoding was pegging the CPU, rarely touching the GPU and causing 4K playback to be impossible. Moving to the VM and passing the GPU through solved that issue (mostly).

Unraid - GUI and VM Offline Problems

Recommended Posts

mytech

Link to comment

itimpi

Link to comment

mytech

Link to comment

mytech

Link to comment

JorgeB

Link to comment

mytech

Link to comment

JonathanM

Link to comment

mytech

Link to comment

JonathanM

Link to comment

mytech

Link to comment

Join the conversation