Unraid stalls, crashes and corrupts the usb (Ryzen) (Threadripper) (fixed)

mdrodge · October 22, 2020

Unraid stalls and becomes unresponsive for 15 minutes or more at a time.

All vm's unresponsive.

Cpu usage goes to 1 - 2%

It often seems connected with files transfers.

At times it has crashed the Aquntia 10gb nic and this will cause it to fail to boot properly or so it seems though one time it turned out to be a network.cnfg file on the usb had corrupted (cause or effect???) Other times just pressing reset a few times is enough.

Update, It's now fixed see the steps below

Edited November 30, 2020 by mdrodge

mdrodge · October 22, 2020

Its just crashed and now won't boot properly. I put the same usb in a different pc.... boots just fine.....

WTF

Swap it back to the server and exactly the same.

Ipv4: my . ip . Address

Servername login: _

But no response from ip and the vm's dont show up.

This is has been happening for a while.

Edited November 4, 2020 by mdrodge

trurl · October 30, 2020

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

trurl · October 30, 2020

mdrodge · October 31, 2020

I CAN'T START MY VM VFIO GROUP COULDN'T BE CLAIMED.

The vfio group it was trying to claim was not one I wanted to pass through but was a multifunction counterpart

I use pcie acs override set to both as that was necessary when I set it up.

Today i noticed that the usb controller i pass through to one of my vm's seems to be part of the chipset and it has a multifunction part that is the sata controller so I'm going to try and use a different USB controller for pass through.

This will involve some rewiring of the internal usb connectors but it allows the vm to start without requiring a full restart and still get some usb.

Edit. This is a different problem to the main one it turns out. I'll leave it here though. Might help someone.

Edited November 4, 2020 by mdrodge
better explained because not part of main topic.

mdrodge · November 1, 2020

On 10/30/2020 at 10:17 PM, trurl said:

Global c state and power control idle changed.

2.9ghz (below stock) core 2133hz ram

Let's see how we go.

Edited November 16, 2020 by mdrodge

mdrodge · November 1, 2020

Why would disabled c state and idle power make one of my gpu's disappear.

Vfio and devices see it and it's in its own group not passed through or anything (it's a gpu for plex), but in the settings/ unraid nvidia section under info it's missing.

(Vfio-pci config beta bug i guess)

Update..................

Reboot, same.

Remove gpu, Reboot, replaced gpu Reboot, same

Move gpu to different slot (from x16 to x8).... it's back

Move it all back to original layout, still broken.

So this configuration leaves me with only 1 pcie x16 slot unless I swap that card for the one in the other x16 slot witch means move every card around change the bios to display on a different slot then remove the vfio config file from unraid resetup and then test the new configuration.

What a bitch.

Update 2....

That is possible. I have 2x 16x slots again and everything is passed through just need to create a new template for the vm effected and wait and see if it's fixed..

Pray for me.....

Edited November 9, 2020 by mdrodge
Update2

mdrodge · November 4, 2020

Mostly good but now the VM (with a nvidia 1050ti, usb and an nvme) crashes..

The rest of the system is unaffected it seems and I can stop the vm and start it again but what ever I was doing at the time is lost.

I think there were several issues originally.

The system seems better but I still have 1 but left to fix.

I'll look for a log now.

Update. Nothing in the vm's log..

And I can't find any reference to this issue elsewhere. It's crashed a few times today and this is new behaviour.

(Perhaps I'll look and see if I still need both options for pcie acs override)

I was hoping to gpu render on Windows so I guess that'll have to wait.

The other vm (vnc and no passthrough at all) is not effected at all so it seems like a passthrough issue but I don't know.

Edited November 4, 2020 by mdrodge

mdrodge · November 4, 2020

AMD CBS - ACS Enable

I never noticed a ACS Enable setting in the BIOS maybe that'll help so it's getting enabled.

Though now I'm wondering what PCIe ARI Support is??? (Update ARI made the vm crash on boot but ACS Enable seems good.)

Update..... This setting gives me the same level of group separation as enabling the "both" option in the VM settings and i now have that set to Downstream and it works fine

Edited November 16, 2020 by mdrodge

mdrodge · November 7, 2020

It still crashes. 2 days in and the vm locked up. I rebooted the vm and 1 hour later the whole system stalled and had to be reset.

I was importing a 50gb file into Windows when I crashed.

Turbo write is on incase your wondering.

It booted first time though.

I didn't think to try and pull a log file before I reset.

I'll try and force a crash and pull a log tonight. (Now I am close to pointing out a cause)

Edited November 7, 2020 by mdrodge

mdrodge · November 9, 2020

It didn't quite crash that time but it got choppy for a bit.

I was moving files with unbalance (a fix from the last round of crashes) and it started a parity check so I stopped both and it's fine now.

I don't see any reason for a busy array to cause the vm and unraid gui to become unresponsive but that seems to be what just happened.

Edited November 9, 2020 by mdrodge

trurl · November 9, 2020

14 hours ago, mdrodge said:

I was moving files with unbalance (a fix from the last round of crashes) and it started a parity check

Nov  8 01:16:44 STUDIO-X emhttpd: unclean shutdown detected
...
Nov  8 01:17:07 STUDIO-X kernel: mdcmd (40): check nocorrect
Nov  8 01:17:07 STUDIO-X kernel: md: recovery thread: check P ...

As you can see it started parity check immediately after starting due to unclean shutdown but I guess you didn't notice until after you had started using unBALANCE.

You really need to let the parity check complete. Since it was an unclean shutdown it is not unlikely that you have a few sync errors. Exactly zero sync errors is the only acceptable result and until you get there you still have work to do.

I guess you can wait a while to see if it crashes again before bothering to fix parity though.

mdrodge · November 9, 2020

Update on new steps I've taken.

(The c-states etc seems to have fixed crashes and the stalls seem array related so.

1*

I have an inport folder on the cache and anything brought in from Windows is passed into that using Resilio and every minute i run a user script for permissions of that folder. ( newperms /mnt/user/folder with a custom schedule of */1**** )

This allows me to not check the whole array every hour like I was doing before witch was dumb and was causing all types of issues indirectly.

I now check the whole array at 1am with the same sort of script as above

2*

CA Auto Turbo write (Hopefully will give a better balance than having it always on turbo.)

3*

Change my mover schedule so it's nightly instead of hourly.

(I couldn't do this before because a full cache would cause docker and vm's to crash)(a month ago I did step 4 and now I can do step 3 as a result)

4*

I set up a second pool with just 1 ssd and moved appdata, system and iso to it so that even if the cache filled the rest of the system isn't effected)

All this puts much less work on the array so should help with the stalling issue and also because less stress on array it can spin down, this will allow step 2 to have an effect (it'll switch to read/modify/write unless i'm waking the whole array (i set 1 disk spun down so I'd need to have woken most of the array and I believe I'd need to be actively using it to do that normally)

I'm hoping this is the end of this/these issues.

(It'll probably still have the odd little pauses from time to time but i might be able to live with that)

My future option is to build a 2nd pool and add drives to that as a zfs array.

This would sidestep all the array issues i have and give me the speed I miss but I'd need to replace 2x 10tb red drives with 2x 12tb gold drive so they all match and that's expensive even at half what they cost in 2017.

Edited November 9, 2020 by mdrodge

mdrodge · November 16, 2020

So it's working very well now. 9 Days without issue

Occasionally it will stall a little but only when Parity is running.

It's useable now.

Unraid stalls, crashes and corrupts the usb (Ryzen) (Threadripper) (fixed)

Recommended Posts

mdrodge

Link to comment

mdrodge

Link to comment

trurl

Link to comment

trurl

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

trurl

Link to comment

mdrodge

Link to comment

mdrodge

Link to comment

Join the conversation