UNRAID 6.6.6 - Soft Lockup - Ryzen 1800X - Asus X370 Pro

Marshalleq · January 17, 2019

All, while testing out Unraid (haven't paid yet) as a potential replacement for my QNAP's I'm getting the below infrequently, but frequently enough that I won't be able to use this as a NAS / VM or whatever else I need to do if I don't fix it. I have disabled c states and am running latest BIOS 4207 with Agesa 1006.

I think it's a soft lockup as the keyboard lights were still on, but actually didn't 100% confirm. I did get a hard lockup the other day, but that was before I disabled c states in the bios.

Can anybody help with any ideas / suggestions? If it's a hardware fault, I 'should' be able to return it - I assume the AMD warranties are fairly decent in length.

Thanks,

tower-diagnostics-20190117-2140.zip

Marshalleq · January 17, 2019

Wow, just posted and default filter pushes it behind all these other posts that are older. Doesn't give me much visibility does it.

JorgeB · January 17, 2019

In the bios looks for "Power Supply Idle Control" (or similar) and set to "typical current idle" (or similar).

Marshalleq · January 17, 2019

Thanks for that, mine was set to Auto, so have now changed it to typical current idle. Is this a known problem in general or specific to the error I posted? I haven't come across it before - googling now....

Thanks.

Marshalleq · January 17, 2019

I did find the below link, which suggests if I find the deep sleep setting and enable it I can then turn back on these other settings. Can't say I've noticed a deep sleep setting, but would be interested in your thoughts on this too. Sounds like you've figured out how to make yours nice and stable. I'll see how your suggestion works anyway. Thanks.

JorgeB · January 17, 2019

2 hours ago, Marshalleq said:

Sounds like you've figured out how to make yours nice and stable.

I don't have a Ryzen CPU, but that bios setting seems to fix the locking up issue with Unraid/Linux.

limetech · January 17, 2019

If you google 'ryzen linux' you will find numerous reports. Here is the granddaddy:

https://bugzilla.kernel.org/show_bug.cgi?id=196683

AMD replaced early chips:

https://www.extremetech.com/computing/254750-amd-replaces-ryzen-cpus-users-affected-rare-linux-bug

Marshalleq · January 21, 2019

Thanks for this - I've had it running for days now, with no issue. Though it hasn't exactly been idle, nevertheless, I think (hope? pray?!) it's sorted now!

Marshalleq · January 29, 2019

It just happened again overnight - where I suspect it was near idle. I've added the rcu_nocbs=0-15 because I read this was still required and rebooted and now it crashes before it even boots up. So will have to remove that. But the question remains, what's crashing the system and how do I fix it. Any other options you can think of?

Marshalleq · January 29, 2019

So I added the rcu_nocbs=0-15 parameter slightly wrong I think, which was causing the crash. Fixed now. Will see how that goes.

juan11perez · February 1, 2019

Good day. New to Unraid and enjoying the learning. I built an Asrock x470itx with a Ryzen 2700x. My first build !

Picked the parts based on...... ordered them and then started reading about the lock ups

Anyway, took me 2 days to work out how to plug everything and get it up and running. Whilst updating the MOBO bios, I decided that it made sense to set up the RAM frequency, since it was stated in the package and I was there........

Plugged Unraid and started running; whilst executing pre-clearing the server would drop about every 4 hours.

To summarise, I tried every CPU setting in the bios with no success; until in the end I remembered fiddling with the RAM frequency...

Put it back in Auto, which I should have never touched and surprise, surprise..... server running without problem for 2 straight days.

Perhaps my outcome may have been obvious, but since no doubt there are many more newcomers like me, I thought worthwhile sharing this experience.

Marshalleq · February 2, 2019

Thanks, I have mine on Auto too - And it's been running fine since my last post. However yesterday and today I have again had lockups.

Haven't started googling yet, in case it's something new - though I don't see why it would be - any ideas? These Ryzens are seemingly quite a hassle. To recap, I've set the power supply idle state and set the c-states and added the rcu_nocbs=0-15 (15 in my case) to the kernel. Getting tired of having to do parity checks, which then find errors

Marshalleq · February 2, 2019

Maybe I should run the new Unraid RC - I think I read it has some AMD stuff in it and definitely newer kernels...

Marshalleq · February 2, 2019

So since it didn't work anyway, and it appears to be unnecessary with my BIOS settings, I've removed the rcp_nocbs=0-15 from my kernel boot params.

https://utcc.utoronto.ca/~cks/space/blog/linux/KernelRcuNocbsMeaning

This didn't happen when I was running Proxmox - and actually I didn't even have the BIOS settings set on that....

Think I'll try the RC for a few days.

Marshalleq · February 2, 2019

Updated to 6.7.0-rc2 and it has already crashed again. That's probably going to make it easier to track down - I'm suspicious about disk io now.... this can't be specific to Ryzen or it would have always been happening. I have added a Dell Perc H310 card though and probably now I'm actually beginning to write to those disks. Hmmm

Marshalleq · February 2, 2019

Call me suspicious, but have disabled the BTRFS cache. Will see how that goes. I thought there was a way to use XFS in the cache, however it doesn't seem to be available in RC2 - not that I saw anyway.

JorgeB · February 2, 2019

1 minute ago, Marshalleq said:

I thought there was a way to use XFS in the cache

There is if it's a single device cache, pools can only be btrfs, though doubt very much btrfs has anything to do with those crashes.

Marshalleq · February 3, 2019

Yeah, even in single device, reformatted as nfs, it changed it to BTRFS. Even with the pool previously removed. I did see something about Btrfs caches being a problem before and one of those errors took me to an XFS issue which sounded similar. Stopping some large writes I was doing, it is now not crashing. Therefore I have removed the cache and restarted the writes. Will see what happens. If it goes away, I'll try to recreate it again I guess. I'd rather it was cache than something hardware related.

JorgeB · February 3, 2019

6 minutes ago, Marshalleq said:

Yeah, even in single device, reformatted as nfs, it changed it to BTRFS

You mean XFS? Then you missed a step, as XFS works for single cache.

Marshalleq · February 3, 2019

Yes, I formatted XFS first (confirmed that), but then it reformatted to Btrfs as soon as it was added. I recall last time I could bury into the settings and change it - but didn't seem to find it this time - not as simple as I would have thought. And I AM on the RC. Not sure if it's still in the latest release candidate.

Marshalleq · February 3, 2019

Found it - put cache on XFS now.

Marshalleq · February 4, 2019

Sadly, two more. It always happens when I'm doing high IO. Maybe it's network. Guess I'm going to have to properly figure out how to read kernel exceptions.

Marshalleq · February 8, 2019

I also removed the 2nd NIC a few days ago. However, I woke up this morning and it had crashed. Maybe it's running on bitcoin lol.

I'm really struggling to get good logs on this one. It lasted for a few days longer, however as I had seen before this seems to happen more quickly at times of high I/O which has not been happening lately. Hmmm, I wonder if I have one of those Marvell controllers.

Marshalleq · February 8, 2019

Diagnostics attached. The crash happened somewhere between 10pm last night and 7.15 this morning.

obi-wan-diagnostics-20190209-0711-2.zip

Marshalleq · February 8, 2019

Seems the logs are wiped at every reboot. Surely not, but that's what they're telling me.

UNRAID 6.6.6 - Soft Lockup - Ryzen 1800X - Asus X370 Pro

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation