New to Unraid - Instability and disk issues


Recommended Posts

Hi,

 

I appreciate this is a long ramble without a lot of specific info, but I'm hoping to use this thread to keep track of further issues as they occur and then hopefully I can get to the bottom of what is wrong.

 

I've recently been trying out Unraid after hearing about it several years ago. I started with a Z170 intel board and crappy dual core celeron as a test, and once I was happy I moved my install over to new hardware.

 

I now have a Ryzen 2700 on an ASRock x370 Taichi board, with 32GB of Crucial memory, and an RX 580.

 

I have 4x 4TB drives in the array, and 2x 800GB Intel SSDs as cache.

 

I've been having lots of issues, and every time I fix one I think everything's good, only to be hit with another problem. I haven't been keeping good notes so I apologise for not having better information. It feels like I've been having multiple separate issues, and think I have fixed quite a few.

 

Many of the issues revolve around my Windows 10 VM. I couldn't get the GPU working at all without upgrading to the latest x370 BIOS. This seems to have been a known issue, and I was relieved to find a simple solution. I still had issues installing Windows, but after switching to Q35 and Seabios, I was able to get a working VM. I've still had some issues with stability, usually when first starting the VM. I was running my RAM at 3200 (which it's rated for) but reset this back to defaults to see if it helps with stability.

 

My cache drives have been sporadically reporting a small number of CRC errors. I thought this might be cables, but replacing them does not appear to have worked. The other likely culprit is the Icy Dock bay hot swap bay they're in. I'm going to remove that and see if that fixes it. It's possible that I first saw the CRCs when they were connected without the Icy Dock, but I can't be sure. A few days ago I woke to find my cache read only and some associated warnings. I think there may have been some warnings regarding the array as well. I wasn't able to remount the cache poo,l, but I was able to mount both disks manually, copy all the data to the array, reformat and copy it back. They've behaved OK since, apart from a small number of CRCs.

 

 

Sometimes when I start my Windows VM, I get an error on screen:

 

amd-vi completion-wait loop timed out

 

A reboot of the system seems to solve that, but it will often return on subsequent boots of the VM until I reboot the whole system. I added a BIOS ROM for my card, but this does not appear to have helped. When testing this last night, I got the error and noticed high CPU on some cores whilst stuck in this state. I killed the VM and tried again. I got the same error again, but after waiting the screen went blank and was eventually replaced by a windows start up error complaining  'an unexpected IO error has ocurred'. I looked in the Unraid UI and expected to see issues with some disks, but everything looked good. I killed the VM and went to bed, leaving my crashplan backup running via docker backing up array files. The VM disk is on the cache pool.

 

I woke up today to find two of my array drives reporting exactly 3 errors each. One was the parity drive. Stupidly I've since rebooted which reset the error count. I've had no errors since, and I'm 20% through a parity check which I started. No errors there yet either.

 

I'm completely lost as to what is going on. Last night before the issues, I used my Windows VM for several hours with GPU passthrough without a single glitch or issue. No disk errors, CRC or otherwise. When it's working, it seems rock solid... Until suddenly it isn't. 

 

tower-diagnostics-20200225-0913.zip

Link to comment
3 minutes ago, johnnie.black said:

It might be but it's still an overclock and known to cause stability issues with some Ryzen servers, more info here.

Thanks, I'll have a read of that. I have 2/4 slots filled with 16GB dual rank modules, so it sounds like I should be at 2400. Something else to check when I get home, although I think that's what it was indeed running at. Part of this is that I haven't built a PC for 10-15 years and so I'm still learning the Ryzen platform.

Link to comment
3 hours ago, robwalker said:

I woke up today to find two of my array drives reporting exactly 3 errors each. One was the parity drive. Stupidly I've since rebooted which reset the error count. I've had no errors since, and I'm 20% through a parity check which I started. No errors there yet either.

 

This kind of points to ram. @johnnie.black is pointing you in the right direction. I had two drives out of a array of 14 that would show read errors when doing a parity sync running my ram @ 3600 and 3200. Currently running at 2933 and no more read errors. 

Link to comment
4 minutes ago, robwalker said:

Thanks, I'll have a read of that. I have 2/4 slots filled with 16GB dual rank modules, so it sounds like I should be at 2400. Something else to check when I get home, although I think that's what it was indeed running at. Part of this is that I haven't built a PC for 10-15 years and so I'm still learning the Ryzen platform.

 

Sounds like you have most of the other issues hammered out. If you still have trouble @ 2400, run a mem test to rule out a bad dim. Also some Ryzen boards are picky about the ram slots you have the memory in. What board do you have? I suspect the manual points to which slots to use based on the number of dimms used. 

Link to comment

Thanks everyone for your input. I'm going to check my memory speed - although several of the issues were experienced after a removed the overclock. I'll also check the RAM slots I used. 

 

I still have the 'amd-vi completion-wait loop timed out' issue when starting the VM sometimes. I think/guess this could be an issue with the RX 580. I'm not set on keeping this card, so I may think about replacing it. Or maybe installing a cheap host card, as currently it's my only one. My 'test' Intel setup had on board graphics which may have helped with the GPU passthrough stability.

 

I have 6 days until my trial ends, so maybe I'll go ahead and grab a licence after all!

Link to comment
36 minutes ago, robwalker said:

amd-vi completion-wait loop timed out

 

This could be BIOS related. There are some posts in other forums that reference this error, and they were cleared up with newer BIOS versions. Are you running the latest BIOS for this mainboard? 

 

Not sure but if you ask LT might extend your trial for another 30 days if you asked. Especially if you reference some of the issues you are troubleshooting.

 

Take a look at this link: There are a few suggestions here that seemed to help.

 

https://unix.stackexchange.com/questions/519758/amd-vi-completion-wait-loop-and-other-errors-messages-in-my-attempt-to-install-a

Edited by Chess
url
Link to comment
7 minutes ago, robwalker said:

Yes, it's the very latest BIOS for the ASRock x370 Taichi. Until I updated to that, I couldn't get GPU passthrough working at all. I think I read that same post this morning. None of the fixes sound too appealing, but it's worth investigating further.

 

I figured you had the latest BIOS. Maybe do a reset of the BIOS before doing anything else. Could be some setting is saved in the BIOS that is not set right. 

 

Are you booting unRaid in EFI or legacy? Switch to Legacy if you are booting EFI and see if the issue goes away. More of a shot in the dark, but...

 

Any chance you can get a different video card to test with? Hate to spend money on something else and only find out the issue is with the MB BIOS and a new card still does the same thing.

 

 

Link to comment

My parity check completed without any errors, so that combined with the cache scrub give me some confidence in the disk side of things.

 

I've checked the memory speed and slots, everything looks good. I've set the power control to 'typical current idle'.

 

No issues so far today at all. I'll continue to feed back over the next few days - hopefully with good news!

 

Thanks for all your help.

Link to comment

I used the Windows VM all evening with no issues. Later on I shut it down and used SpaceInvader1's docker container to experiment with a MacOS VM. Twice while installing MacOS, the whole Unraid System hung. I wasn't passing through any hardware to the VM, just using VNC in a browser.

 

I kicked off a memtest before leaving for work this morning, so we'll see if that reveals any problems.

Link to comment
2 hours ago, robwalker said:

I used the Windows VM all evening with no issues. Later on I shut it down and used SpaceInvader1's docker container to experiment with a MacOS VM. Twice while installing MacOS, the whole Unraid System hung. I wasn't passing through any hardware to the VM, just using VNC in a browser.

 

I kicked off a memtest before leaving for work this morning, so we'll see if that reveals any problems.

 

Let us know how the memtest comes back with. I don't expect you to get any errors based on your previous post. I've not setup a MacOS VM yet, so I don't have any experience there, but it should not have caused unraid to freeze. 

Link to comment

I'd suggest from here starting a new thread for the macOS VM issue, as you'll pull in people that have more experience with that. Also look at setting up a syslog server from inside unraid under system syslog server, to see if it can capture any errors right before it freezes. That will help with troubleshooting.

Link to comment

I kicked off another MacOS install attempt after the memtest, and it locked up at pretty much the same place again.

 

However....

 

I then watched SpaceInvaderOne's 2017 video on Ryzen builds. From his suggestion I added:

 

rcu_nocbs=0-15

 

to my syslinux config and tried another install on MacOS....

 

Boom! It installed fine and works great.

 

So, given that the system crashed 3 times in a row when installing before, I think I may have solved it. I'm not going to relax just yet, but fingers crossed!

 

Thanks for all your help so far.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.