Rough weekend with Unraid - VM crashes with graphics set to Vega caused multiple disk failure + dataloss (twice / repeatable)


danull

Recommended Posts

Running Unraid 6.7.2 with 7 drives in array, dual parity.    Gigabyte AX370 Gaming motherboard with the latest F42A bios as of July 2019 + AMD 2400G with the Vega graphics.   Had been stable for days.    Created a Linux VM and it was working fine with VNC.   This VM uses a MSATA in a PCIE conversion card as an "unassigned" device  for all of its Linux Mint installation data since it came from a NUC I chose to decomm in favor of Unraid VM.   I had a scheme to change graphics card in Unraid's VM settings from VNC to the Vega graphics on the 2400G and see if that would passthrough the Linux VM console to my KVM/normal monitor.  I came to regret that decision.  EDIT:  A friend asked about IOMMU - I didn't mess with any settings related to that, to my knowledge.

 

Immediately upon re-starting the VM with graphics card set to the Vega graphics instead of VNC, boom.... parity disk 2 fails and disk 1 fails.  I *think*borracho-diagnostics-20191012-0443.zip the attached logs show this point in time.   Unraid also seems to hang so I reboot.   Suffered a few more reboots because the VM was set to autostart and would hang each time I tried to start the VM service.  Eventually the VM service was left down and I rebuilt from remaining parity which took 15 hours (stopped array, moved the parity and data disk to unassigned, started array, stopped array, re-assigned them).  After the rebuild I had to stop the array, start in maintenance mode, and had to do a XFS_REPAIR because disk 1 was thrashed and inaccessible.  Lots of stuff moved to LOST+FOUND and lots of data loss because I was unwilling to go through hundreds of unnamed files that have lost their file types to figure out what each one was... I was able to recover it from various backups and other sources.   I think the WIKI that says to run xfs_repair -v in console is not a good thing because when xfs_repair -v ran in console it just put out thousands and thousands of periods and seems to run indefinitely, starting the array in maintenance mode and selecting the data disk in the GUI and changing the default -n to -v and then clicking refresh until done seemed to work MUCH better, by the way.  

 

So then things are recovered from the first crash.   I thought I had successfully deleted the VM and started the VM service again in Unraid hoping to get back to the VM with VNC which worked fine, boom it autostarts and this time Parity disk 1 fails (different parity this time) along with disk 1 again.   You can imagine my reaction to that.     Over the next 15 hours I rebuilt again and did a better job deleting the VM and preventing autostart the next time around.   And at least no XFS_REPAIR needed this second time around.   After this second recreation of the failure and rebuild I was able to restart VM with VNC graphics card as the setting and everything is fine and stable again since then.    So I'm scared away from ever defining a VM again with the Vega graphics output.

 

There is no way I want to intentionally recreate this a third time (as you can imagine!).   I am posting this just in case the logs or scenario I've described help a developer figure out what happened and fix this in code to prevent this from happening to somebody else.   Based on how I was able to recreate this twice with the exact same trigger I do not believe my disks are bad or that I have flaky SATA cables which seems to be blamed for most things around here.   I don't think Unraid should be marking multiple disks failed and having a corrupt xfs file system even if it is an AMD bug, and  seems like Unraid should do something to prevent this scenario from happening if there are known issues with modern AMD boards/BIOSes.

 

Thanks for reading my harrowing tale,

-Dan 

 

PS - apparently I am not alone here, looks like at least one other person has seen this disk breakage:

https://linustechtips.com/main/topic/1092004-unraid-amd-igpu-passthrough/

 

Edited by danull
extra info
Link to comment
2 hours ago, danull said:

I don't think Unraid should be marking multiple disks failed

unRaid has no choice there, because a write to the disk did indeed fail.

2 hours ago, danull said:

having a corrupt xfs file system even if it is an AMD bug

The key is if the file system was corrupted after you undid your attempt at passthru.

 

2 hours ago, danull said:

I think the WIKI that says to run xfs_repair -v in console is not a good thing because when xfs_repair -v ran in console it just put out thousands and thousands of periods and seems to run indefinitely, starting the array in maintenance mode and selecting the data disk in the GUI and changing the default -n to -v and then clicking refresh until done seemed to work MUCH better, by the way.  

Starting in maintenance mode is how you're supposed to run the commands.  No idea though if this contributed to the problems.

 

From the wiki:

Quote

Drives formatted with XFS

Note: for more info on the xfs_repair tool and its options, see xfs_repair.

The xfs_repair instructions here are designed to check and fix the integrity of the XFS file system of a data drive, while maintaining its parity info.

Preparing to run xfs_repair

Start the array in Maintenance mode, by clicking the Maintenance mode check box before clicking the Start button. This starts the unRAID driver but does not mount any of the drives.

 

Link to comment
1 minute ago, bonienl said:

Please explain here

My use for Unraid is primarily for a NAS.  The VM and docker stuff is secondary.   If Unraid cannot maintain integrity of my data and is marking perfectly good drives bad for no good reason, I don't believe that is reasonable - even if it is really my CPU/GPU's fault due to widely acknowledged bugs.

 

Unraid knows what kind of graphics I have, i.e. my AMD 2400G.   It knows this because it shows Vega in the VM pulldown menu, etc.  I suspect this fellow on Reddit also has an AMD iGPU although he/she hasn't actually stated that yet.

 

If we have multiple events in the field where people are having their arrays go into failed states when trying to do GPU passthrough, I think Unraid should probably put a big warning that attempting to do GPU passthrough and using something other than VNC may cause instability or loss of drives, especially with AMD  hardware.  I would have never done anything except VNC if I knew every time I tried GPU passthrough I'd lose parity and a drive, and I suspect this person on Reddit is probably of the same mind.    Maybe we'd be OK if we had seen some kind of warning in Unraid on the VM settings page.   Instead we just innocently change VNC to Vega to see if we can get it to work and then whamo we are in a nightmare where we have to recover drives....surely Unraid developers don't see that as reasonable?   I'm willing to bet if we searched the forums we'd find other people who have lost perfectly good working hard drives in unraid simply because they tried GPU passthrough.   

 

Anyway, that is my attempted explanation of how I think this should be seen as a bug and how I think Unraid developers should protect us from VNC-> GPU passthrough = drive failure nightmare.   Thanks for asking.

 

-Dan

 

 

 

Link to comment
19 minutes ago, bonienl said:

I have never heard of that. Did you do a search?

You're right, maybe it is just the two of us so far.  Or maybe I am very bad at searching.  I guess maybe I saw his post and was just sensitive and surprised to see somebody else saw it too!  Still, this seems like two too many of this failure ;-)

Link to comment
3 minutes ago, Squid said:

Not to minimize the effects here for you, but this is here on the VM Settings page (and presumably you did have to do one or both to passthru the iGPU).  YMMV

image.thumb.png.16ac7ad1feb4e0253b9bff492ecb04e2.png

 

Squid, FYI I see neither of these options on my VM settings page.   I just change VNC to AMD Radeon Vega Series / Radeon Vega Mobile Series in the "Graphics Card" pull down, and if I start the VM like that, crash + disk failures.

Link to comment

I just received this message from ZenN on this Unraid forum by the way - he also has a 2400G exactly like me.

 

"That was my post on Reddit that you found and linked. I'm using a 2400G as well. Seems we did see the exact same problem.

It won't let me post on the forums here because "mod approval". Is there any chance I could get you to give me more detailed steps as to how you resolved this? I'm terrified to try anything in case I mess it up more and lose my data."

Link to comment
2 minutes ago, danull said:

It won't let me post on the forums here because "mod approval".

Anti-spam measure.  The first post from anyone has to be approved by a mod.  Doesn't matter whether its a simple "Hello", or "WTF", or critical of LT etc so long as it's not spam it'll get approved

Link to comment
5 minutes ago, Squid said:

Switch to advanced view.  Are they currently disabled and set to No?

Wish I could, I have no idea what this means in the Wiki which I found for switching to advanced view as I couldn't figure it out on my own:

If you wish to toggle other advanced settings for the VM, you can toggle from Basic to Advanced View (switch located on the Template Settings section bar from the Add VM page).

There is no switch in either UPDATE VM or ADD VM pages.   Can switch from form view to XML view but that seems to be different.

 

Anyway I never messed with any advanced VM settings since I can't even figure out how to access the page - so they're going to be whatever default is for the "Linux" VM template.   I literally just change VNC to the 2400G Vega card in Graphics Card and start the VM and kablamo, party is over....I think this is also what ZenN did with his/her 2400G too.   Hopefully he/she is approved to weigh in soon ;-) 

Link to comment
6 minutes ago, danull said:

Template

Settings - VM Manager

 

I'm just pointing out that unRaid does warn about the consequences of changing those settings under certain hardware combinations.  If yours are currently set to Disabled and No, then you've never changed them from the safe defaults.  Not trying to be argumentative or diminish the effects which you've noticed on your hardware.

Link to comment

Oh I see it is on the Settings->VM page not on the VMS page.

 

PCIE ACS override is disabled and VFIO allow unsafe interrupts is NO, which I imagine is the default.

 

PS - understand you're not being argumentative, just asking questions to narrow it down which I appreciate.   Just hoping this ends up helping somebody else in the future.   I'm fine with VNC, no chance of me ever trying GPU passthrough again after what I went through  ;-)

Edited by danull
the PS since Squid and I posted same time
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.