Marshalleq Posted January 17, 2019 Share Posted January 17, 2019 All, while testing out Unraid (haven't paid yet) as a potential replacement for my QNAP's I'm getting the below infrequently, but frequently enough that I won't be able to use this as a NAS / VM or whatever else I need to do if I don't fix it. I have disabled c states and am running latest BIOS 4207 with Agesa 1006. I think it's a soft lockup as the keyboard lights were still on, but actually didn't 100% confirm. I did get a hard lockup the other day, but that was before I disabled c states in the bios. Can anybody help with any ideas / suggestions? If it's a hardware fault, I 'should' be able to return it - I assume the AMD warranties are fairly decent in length. Thanks, tower-diagnostics-20190117-2140.zip Quote Link to comment
Marshalleq Posted January 17, 2019 Author Share Posted January 17, 2019 Wow, just posted and default filter pushes it behind all these other posts that are older. Doesn't give me much visibility does it. Quote Link to comment
JorgeB Posted January 17, 2019 Share Posted January 17, 2019 In the bios looks for "Power Supply Idle Control" (or similar) and set to "typical current idle" (or similar). Quote Link to comment
Marshalleq Posted January 17, 2019 Author Share Posted January 17, 2019 Thanks for that, mine was set to Auto, so have now changed it to typical current idle. Is this a known problem in general or specific to the error I posted? I haven't come across it before - googling now.... Thanks. Quote Link to comment
Marshalleq Posted January 17, 2019 Author Share Posted January 17, 2019 I did find the below link, which suggests if I find the deep sleep setting and enable it I can then turn back on these other settings. Can't say I've noticed a deep sleep setting, but would be interested in your thoughts on this too. Sounds like you've figured out how to make yours nice and stable. I'll see how your suggestion works anyway. Thanks. Quote Link to comment
JorgeB Posted January 17, 2019 Share Posted January 17, 2019 2 hours ago, Marshalleq said: Sounds like you've figured out how to make yours nice and stable. I don't have a Ryzen CPU, but that bios setting seems to fix the locking up issue with Unraid/Linux. Quote Link to comment
limetech Posted January 17, 2019 Share Posted January 17, 2019 If you google 'ryzen linux' you will find numerous reports. Here is the granddaddy: https://bugzilla.kernel.org/show_bug.cgi?id=196683 AMD replaced early chips: https://www.extremetech.com/computing/254750-amd-replaces-ryzen-cpus-users-affected-rare-linux-bug Quote Link to comment
Marshalleq Posted January 21, 2019 Author Share Posted January 21, 2019 Thanks for this - I've had it running for days now, with no issue. Though it hasn't exactly been idle, nevertheless, I think (hope? pray?!) it's sorted now! Quote Link to comment
Marshalleq Posted January 29, 2019 Author Share Posted January 29, 2019 It just happened again overnight - where I suspect it was near idle. I've added the rcu_nocbs=0-15 because I read this was still required and rebooted and now it crashes before it even boots up. So will have to remove that. But the question remains, what's crashing the system and how do I fix it. Any other options you can think of? Quote Link to comment
Marshalleq Posted January 29, 2019 Author Share Posted January 29, 2019 So I added the rcu_nocbs=0-15 parameter slightly wrong I think, which was causing the crash. Fixed now. Will see how that goes. Quote Link to comment
juan11perez Posted February 1, 2019 Share Posted February 1, 2019 Good day. New to Unraid and enjoying the learning. I built an Asrock x470itx with a Ryzen 2700x. My first build ! Picked the parts based on...... ordered them and then started reading about the lock ups Anyway, took me 2 days to work out how to plug everything and get it up and running. Whilst updating the MOBO bios, I decided that it made sense to set up the RAM frequency, since it was stated in the package and I was there........ Plugged Unraid and started running; whilst executing pre-clearing the server would drop about every 4 hours. To summarise, I tried every CPU setting in the bios with no success; until in the end I remembered fiddling with the RAM frequency... Put it back in Auto, which I should have never touched and surprise, surprise..... server running without problem for 2 straight days. Perhaps my outcome may have been obvious, but since no doubt there are many more newcomers like me, I thought worthwhile sharing this experience. 1 Quote Link to comment
Marshalleq Posted February 2, 2019 Author Share Posted February 2, 2019 Thanks, I have mine on Auto too - And it's been running fine since my last post. However yesterday and today I have again had lockups. Haven't started googling yet, in case it's something new - though I don't see why it would be - any ideas? These Ryzens are seemingly quite a hassle. To recap, I've set the power supply idle state and set the c-states and added the rcu_nocbs=0-15 (15 in my case) to the kernel. Getting tired of having to do parity checks, which then find errors Quote Link to comment
Marshalleq Posted February 2, 2019 Author Share Posted February 2, 2019 Maybe I should run the new Unraid RC - I think I read it has some AMD stuff in it and definitely newer kernels... Quote Link to comment
Marshalleq Posted February 2, 2019 Author Share Posted February 2, 2019 So since it didn't work anyway, and it appears to be unnecessary with my BIOS settings, I've removed the rcp_nocbs=0-15 from my kernel boot params. https://utcc.utoronto.ca/~cks/space/blog/linux/KernelRcuNocbsMeaning This didn't happen when I was running Proxmox - and actually I didn't even have the BIOS settings set on that.... Think I'll try the RC for a few days. Quote Link to comment
Marshalleq Posted February 2, 2019 Author Share Posted February 2, 2019 Updated to 6.7.0-rc2 and it has already crashed again. That's probably going to make it easier to track down - I'm suspicious about disk io now.... this can't be specific to Ryzen or it would have always been happening. I have added a Dell Perc H310 card though and probably now I'm actually beginning to write to those disks. Hmmm Quote Link to comment
Marshalleq Posted February 2, 2019 Author Share Posted February 2, 2019 Call me suspicious, but have disabled the BTRFS cache. Will see how that goes. I thought there was a way to use XFS in the cache, however it doesn't seem to be available in RC2 - not that I saw anyway. Quote Link to comment
JorgeB Posted February 2, 2019 Share Posted February 2, 2019 1 minute ago, Marshalleq said: I thought there was a way to use XFS in the cache There is if it's a single device cache, pools can only be btrfs, though doubt very much btrfs has anything to do with those crashes. Quote Link to comment
Marshalleq Posted February 3, 2019 Author Share Posted February 3, 2019 Yeah, even in single device, reformatted as nfs, it changed it to BTRFS. Even with the pool previously removed. I did see something about Btrfs caches being a problem before and one of those errors took me to an XFS issue which sounded similar. Stopping some large writes I was doing, it is now not crashing. Therefore I have removed the cache and restarted the writes. Will see what happens. If it goes away, I'll try to recreate it again I guess. I'd rather it was cache than something hardware related. Quote Link to comment
JorgeB Posted February 3, 2019 Share Posted February 3, 2019 6 minutes ago, Marshalleq said: Yeah, even in single device, reformatted as nfs, it changed it to BTRFS You mean XFS? Then you missed a step, as XFS works for single cache. Quote Link to comment
Marshalleq Posted February 3, 2019 Author Share Posted February 3, 2019 Yes, I formatted XFS first (confirmed that), but then it reformatted to Btrfs as soon as it was added. I recall last time I could bury into the settings and change it - but didn't seem to find it this time - not as simple as I would have thought. And I AM on the RC. Not sure if it's still in the latest release candidate. Quote Link to comment
Marshalleq Posted February 3, 2019 Author Share Posted February 3, 2019 Found it - put cache on XFS now. Quote Link to comment
Marshalleq Posted February 4, 2019 Author Share Posted February 4, 2019 Sadly, two more. It always happens when I'm doing high IO. Maybe it's network. Guess I'm going to have to properly figure out how to read kernel exceptions. Quote Link to comment
Marshalleq Posted February 8, 2019 Author Share Posted February 8, 2019 I also removed the 2nd NIC a few days ago. However, I woke up this morning and it had crashed. Maybe it's running on bitcoin lol. I'm really struggling to get good logs on this one. It lasted for a few days longer, however as I had seen before this seems to happen more quickly at times of high I/O which has not been happening lately. Hmmm, I wonder if I have one of those Marvell controllers. Quote Link to comment
Marshalleq Posted February 8, 2019 Author Share Posted February 8, 2019 Diagnostics attached. The crash happened somewhere between 10pm last night and 7.15 this morning. obi-wan-diagnostics-20190209-0711-2.zip Quote Link to comment
Marshalleq Posted February 8, 2019 Author Share Posted February 8, 2019 Seems the logs are wiped at every reboot. Surely not, but that's what they're telling me. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.