Unraid 6.5.0 randomly/intermittently hangs every 24-96 hours


Recommended Posts

Hi everyone, I'm back again with trouble since my last topic and my remaining patience is just about gone.

 

tl;dr - Ever since updating to 6.4 back in January 2018, my server has had trouble ranging from VM stutter (now seemingly fixed after changing my append) to completely hanging and not accepting any input every 24-72 hours - my record uptime in this state is ~four days.

 

My VM stutter was "fixed" by adding   pti=off   to my append for startup - I'm aware of the risk involved, but I've almost been pulling my hair out due to frustration and this was a relief to the stutter, but the hanging issue still remains. As before, I still require   vfio_iommu_type1.allow_unsafe_interrupts=1  to be added for my pass through devices to be visible to VMs. I've always left   isolcpus=4,5,10,11   as is so UnRAID doesn't interfere with the VMs.

 

Most of the hardware has not changed since the start of the year. The USB drive was changed due to testing in my old topic, I was not able to continue using my old USB drive as it was blacklisted.

 

So, onto the "new" problem:

 

Specs:

  • Sandisk Facet 8GB USB drive for UnRAID 6.5 stable (Also tried 6.5.1-rc2 with the same outcome, but reverted back to 6.5 to reduce the complexity of the issue)
  • SuperMicro X8DTH-6F (latest BIOS update 2.1b installed) + 1x Xeon X5670
  • 32GB Hynix ECC RAM HMT31GR7BFR4C-H9 - on the Supermicro list of qualified memory
  • Onboard SAS2008 and 1x Dell H310 - both flashed to IT mode with 3x breakout SATA cables
  • 4x 4TB WD Red Drives - One of them as a single parity
  • 6x 3TB WD Red Drives
  • 1x 250GB Samsung Evo SSD for Cache
  • 1x 120GB Corsair Force SSD for VMs - 2 VMs, one called "HTPC" and the other "Haruna", both running Windows 10 x64 LTSB and up to date
  • GT710 passed through to HTPC VM (only this VM restarts at 6am every day to ensure updates are applied when it's unused to avoid interruptions during video playback later in the day)
  • 2x Hauppauge WinTV-HVR-2200 TV Tuners passed through to the "Haruna" VM

 

Dockers:

  • Duckdns (1GB of RAM assigned)
  • Jackett (1GB)
  • LetsEncrypt (autostart is off, only ran once every few months)
  • Plex (4GB)
  • qBittorrent (4GB)
  • Radarr (4GB)
  • Sonarr (4GB)

 

The amount of RAM being assigned to the dockers did not change the outcome of the server hanging, at first I believed the server was simply running out of memory, wasn't the case.

RAM usage is ~40% with average CPU usage most of the time ~10%, obviously depending on load from Plex etc

 

Since my topic back in January, I've done:

  • Memtest, this comes back without any errors after ~24 hours of testing at which point I stop the test as I really need my server
  • Checked and changed multiple values in the BIOS relating to C-States and RAM (also tried failsafe and optimal defaults)
  • Forced the dockers to only have a set amount of memory each in case I was getting OOM errors (ranging from 1GB - 4GB (listed above))
  • Tips and tweaks vm.dirty_background_ratio and vm.dirty_ratio set to 2% and 3% respectively. Also tried 1% and 2% respectively. (this was set back in the 6.3.5 days)

 

I can't effectively attempt to use the Safe Mode on UnRAID as I need the Unassigned Devices plugin to run my VMs from the Corsair SSD - but I'll attempt to run the VMs from the cache drive as troubleshooting if it's really needed.

 

This is the console view after the server has ceased up (screenshot from the IPMI interface), I can not input anything with the virtual keyboard, network shares are not accessible, web management is not accessible and times out, VMs appear to be running, but the VMs also hang after a short time - My only "fix" from all of this, is to hard reset the server via the IPMI management and let it start up again for another few days of use.

 

I was hoping it hangs at a consistent time so I can at least narrow it down, but that wasn't the case. The times at when the server hangs varies greatly, there is not a set or consistent time and it doesn't always seem to coincide with an event, for example, Plex usage, the Mover kicks in or an update for dockers etc.

 

I've included the tail log and three of the last diagnostics before my latest hang (please let me know if more info is required). It seems the latest hang was around 6am this morning (which coincides with the HTPC VM restarting) - although, the two previous days, the VM restarted without issue. As noted above, the hang is not consistent.

 

I was attempting to fix the problem myself rather than bother people on the forums, but I'm now at a dead end after ~three months of trying. Please help me out before I really start regretting going with UnRAID :( During the 6.3.5 days, everything was rock solid with none of the above changed and the only time the server came down was due to an extended power outage that my UPS couldn't last for.

 

(You'll notice that the logs show my cache as being full around the 7th of April, this was the case during a day of heavy downloading and transfers to the cache drive. This does not have any bearing on the frequency of the hanging of UnRAID as it happens randomly during any state of the server)

FCPsyslog_tail.txt

yamato-diagnostics-20180411-0441.zip

yamato-diagnostics-20180411-0511.zip

yamato-diagnostics-20180411-0541.zip

Edited by Goldfire
Typo
Link to comment
7 hours ago, bonienl said:

I recommend you upgrade to unRAID 6.5.1-rc5 and retest. A couple of known hanging issues have been fixed in this release.

 

I just realised that I typed 6.5.1-rc5 in my original post, I meant rc2 (I used the numpad to type that and must've missed the 2 ¬¬ - I've edited my original post)

 

Alright, I'll give that a go and report back. Thank you for the reply.

Edited by Goldfire
Link to comment

That didn't work with running 6.5.1-rc5. Didn't even last 48 hours. I can't provide a diagnostics for obvious reasons.

 

The only thing I can think of is the VM restarting each day, but if that was the case, I'd imagine it would be hanging the server every day the VM restarts.

Besides, I've had the server hang at various times throughout the day, even when the VMs aren't restarting.

 

I've also tried both i440fx and Q35 machine types, both with the same outcome.

 

If it is the VM restarting that's causing the issue... What good will it be if I can't simply restart VMs without the server hanging?

 

So, now I'm here with another ~2-3 days of runtime before it inevitably happens again :|

 

What are my other options?

Any other input from others?

 

EDIT: After a bit more searching, it seems that others that have this issue switched to SeaBIOS instead of using OVMF and had better results, I'll attempt that when the VM isn't in use and give a follow up.

Edited by Goldfire
Link to comment
  • 2 weeks later...

Here's my follow up:

  • Changing the BIOS of my VM with the passthrough GPU from OVMF to SeaBIOS made no change at all
  • Using Q35 instead of i440fx made it feel a bit "snappier" - but it didn't improve the situation
  • Updating UnRAID to 6.5.1-rc6 didn't help
  • Removing the auto restart of the VM (simple batch file on task scheduler in Windows 10) has stopped UnRAID from hanging on occasion
  • Restarting the VM manually can also cause UnRAID to hang, out of ~10 restart attempts, UnRAID would hang after anywhere between 3-5 restarts

With the above info, it seems to be completely random on when UnRAID decides to hang. Having it automatically restart via the batch file at 6am was the main cause of me losing UnRAID overnight.

 

Now I'm reluctant to even restart the VM, I've disabled Windows Update altogether so it can't automatically restart (which I'll manually update once a month for the cumulative updates). I'll just have to use it as is, at least I won't have to worry too much about waking up in the morning and UnRAID is dead in the water.

 

I'm disappointed that I can't reliably restart a VM (even manually), but, I'm also disappointed in the lack of support from the forum when I'm basically high and dry :(

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.