Anybody planning a Ryzen build?


Recommended Posts

33 minutes ago, Pauven said:

 

How do you do this?  Is this in your BIOS or in unRAID?

 

I'm talking about the GPU's bios in this case and a physical switch of the cards to put the 970 in the first slot.  I basically just followed:

 

http://lime-technology.com/wiki/index.php/UnRAID_6/VM_Management#Assigning_Graphics_Devices_to_Virtual_Machines_.28GPU_Pass_Through.29

and

 

In short, to get an Nvidia GPU that is also boot to passthrough you need to manually provide the BIOs to the host with an xml edit.  You can try to use other people's dumps, but it's really not hard to grab a copy of the bios off your actual card, and that did the trick without any issues.  Conventional wisdom is that an AMD card doesn't need this, and I'd really RATHER be using the lower power consumption card for boot, but I just couldn't get that card to initialize at all if it was used to boot, my own bios or using a dumped one.  Might try again at some point, but for now I want to see what I can do in terms of providing some uptime data now that my desktop environment is doing what I really NEED day-to-day and confirmed that my gaming performance is essentially identical to what it was with the 970 and a Haswell i7 running Windows on-the-metal.

 

PS: sorry for the ninja edits

Edited by Bureaucromancer
  • Upvote 1
Link to comment
2 hours ago, Pauven said:

 

Haha!  Absolutely!  Grenades are expensive.  As they say, "Close only counts in horse-shoes and hand grenades and government work".  My clients don't get to question my methods, only my results.  Though I suppose I would have chosen my words more carefully for a paying client, perhaps something like "an Accelerated First Pass Test Algorithm designed to minimize total troubleshooting time and client expense."

 

 

Wow!  That's huge!  Thanks for sharing.  I think this is the first confirmed crash while running a Windows 10 VM with GPU passthrough.  This may confirm my suspicion that the Win10 VM's aren't preventing the crashes, merely delaying them, but the update activity throws a wrench into the mix. 

 

When you say just as the OS was trying to update, at what step in the update process was it performing?  Downloading, Verifying, Installing, Shutdown Installing, Rebooting, Booting Installing, or something else?  Or can you not tell?

 

Hmmm, I wonder if the Windows VM was resuming from a deep sleep at 4:30a in order to begin the update.  That would tie in well with power saving features being part of the issue.  Maybe it isn't going into a low power state that is causing the crash, but rather coming out of it.  As the Windows VM woke up, it forced unRAID and the CPU to wake up too, so to speak.

 

Still doesn't make sense why this would only affect unRAID.  

 

Jon, care to comment on anything special Lime-Tech is doing related to power management in unRAID?

 

-Paul

 

I'm out on business till Sunday  but I recall the VM having an issue finding or saving to storage.  My initial thought was that the update was causing the crash but I lnow believe unraid crashed before the win 10 vm went down.

 

Win 10 VM has power management settings turned off.  no screen saver sleep etc.  IE the crash was not caused by a resume or sleep state.

 

I have another pc monitoring unraid log. hope to have more info shortly.

 

FWIW all of my crashes seem to take place in the AM.

Link to comment

Thanks Bureaucromancer for the guidance, that's very helpful.  I'll look into it further after I get my testing done.

 

Testing Update:

 

Thunderstorms rolled through Atlanta last night, and a multi-hour power outage exceeded my UPS's capacity, cutting my test short.  At last inspection, the server had been up for about 55-56 hours, 48 with a Win10 VM running, and an additional 7 to 8 hours with the VM stopped.  It appears to me that a BIOS setting is resolving the issue, and not necessarily running the Win10 VM.

 

This morning I cranked the server back up, and reset the BIOS to UEFI defaults.  Unfortunately, to my eyes it appears that the defaults did not revert all settings that I tweaked, either that or I am going crosseyed.  I thought about manually enabling all the things I thought I disabled, but decided to run as-is.  Turns out it doesn't matter, as I've already got my first crash of the day and my testing is back on track.

 

So for my first test I re-disabled Cool 'n' Quiet and 'Deep Sleep' (which should control the C4 power state).  The system crashed within 90 minutes.  I didn't even have enough time to get into the GUI and start the array (coffee enjoyment takes priority), so no array and no VM and no nuttin' running.

 

I'm now running my second test.  I re-enabled Cool 'n' Quiet, left 'Deep Sleep' disabled, and disabled both 'C6 Mode' and 'Global C-state Control'.  That second one I had disabled in my earlier 'box-o-grenades' testing, but missed documenting it since it was not in the user manual and I was going from memory.  While I'm mentioning items I missed, I had also previously disabled 'Core Performance Boost' and failed to document it.  Both items are hidden pretty deep in my BIOS, under "Advanced\AMD CBS\Zen Common Options", the same panel where I had disabled SMT.

 

My gut is telling me that C6 Mode or Global C-state Control is where the magic is hiding.  I can't find any good info on what the Global C-state Control actually does, but these last two settings are the only ones left to play with that seem related to idle state power savings.

 

-Paul

Link to comment
1 minute ago, chadjj said:

Hey Guys,

 

I found some good news posted today at the following link.  We can look forward to a BIOS update for our motherboard to address the AMD CPU freezing issues early April.

 

https://www.bit-tech.net/news/hardware/2017/03/31/amd-agesa-update-april/1

 

Chad

 

The crashes that we are experiencing with unRAID are most likely NOT caused by the FMA3 bug that this microcode BIOS update is addressing.

 

Right now the leading candidate for the unRAID crashing on Ryzen problem is either the C6 Mode or Global C-state Control settings in the BIOS (may be called different things on different motherboards, this is ASRock terminology).  Completely unrelated to the FMA3 bug.  Though it may still be a bit early to say this for certain, disabling both has brought stability to my machine.  I plan to do more tests disabling just one or the other, to determine which feature is causing the problem and to make sure this is not a fluke.

 

I recommend others play with disabling these settings in their BIOS as well.

 

-Paul

Link to comment

I found this interesting info on AMD's AGESA code update, which AMD is releasing to mb partners for BIOS updates.  This includes the FMA3 fix, which won't fix unRAID, but also 3 other changes.  I especially like the DRAM latency improvements.

 

Here's the link to the source article:  https://community.amd.com/community/gaming/blog/2017/03/30/amd-ryzen-community-update-2

 

Quote

 

Let’s talk BIOS updates

 

Finally, we wanted to share with you our most recent work on the AMD Generic Encapsulated Software Architecture for AMD Ryzen™ processors. We call it the AGESA™ for short.

 

As a brief primer, the AGESA is responsible for initializing AMD x86-64 processors during boot time, acting as something of a “nucleus” for the BIOS updates you receive for your motherboard. Motherboard vendors take the baseline capabilities of our AGESA releases and build on that infrastructure to create the files you download and flash.

 

We will soon be distributing AGESA point release 1.0.0.4 to our motherboard partners. We expect BIOSes based on this AGESA to start hitting the public in early April, though specific dates will depend on the schedules and QA practices of your motherboard vendor.

 

BIOSes based on this new code will have four important improvements for you

  1. We have reduced DRAM latency by approximately 6ns. This can result in higher performance for latency-sensitive applications.
  2. We resolved a condition where an unusual FMA3 code sequence could cause a system hang.
  3. We resolved the “overclock sleep bug” where an incorrect CPU frequency could be reported after resuming from S3 sleep.
  4. AMD Ryzen™ Master no longer requires the High-Precision Event Timer (HPET).

 

We will continue to update you on future AGESA releases when they’re complete, and we’re already working hard to bring you a May release that focuses on overclocked DDR4 memory.

 

 

Link to comment

I also wanted to chime in and thank the early adopters and developers for their testing. Ryzen seems great on paper but I value stability over anything, so I'll keep using my Xeon until we are sure that unRAID on Ryzen has reached parity in terms of reliability.

  • Upvote 1
Link to comment

So things are looking positive, though a long way from conclusive on my end.  I turned off global c-states (couldn't find anything explicitly about C6 on my C6H) and shut down my VM overnight last night.  Now at 18 hours up-time WITHOUT a vm.

 

Probably going to end up crashing it sometime today experimenting with an Ubuntu VM, but will report in if I get another idle crash or some really significant idle uptime.

Link to comment
1 hour ago, Bureaucromancer said:

So things are looking positive, though a long way from conclusive on my end.  I turned off global c-states (couldn't find anything explicitly about C6 on my C6H) and shut down my VM overnight last night.  Now at 18 hours up-time WITHOUT a vm.

 

Probably going to end up crashing it sometime today experimenting with an Ubuntu VM, but will report in if I get another idle crash or some really significant idle uptime.

Asus released a new BIOS for my Prime x370-Pro so they may have for your board as well.

 

Chad

Link to comment

I have some great news.  I've been watching the ASRock website for BIOS updates, and they finally released a new beta BIOS for my motherboard.  Version 1.99X, the notes were listed as:

 

Quote

1.Improve XMP DRAM compatibility.
2.Add experimental automatic overclocking feature.

 

Both sounded pretty good to me.  Flashing went a lot easier this time versus the nightmare I had last time.  I was able to flash from right inside the BIOS, didn't even need a thumb drive.

 

I booted into the UEFI, and the memory speeds and timings appeared to be unchanged, so no magic bullet there.  When I looked in the Overclocking tab, there was a new checkbox for "Release the Krak'n (Auto-X-Clock)".  I enabled it, saved and rebooted.  The motherboard kept rebooting (I could hear the fans spin up/down/up/down) and it would emit the boot beep sound every 20-30 seconds.  I thought it was caught in a loop, and was about to hit the CMOS clear button when I finally caught a message flashing briefly on my monitor between reboots.  It was too quick to read the whole message, but it seemed to say something about frequency optimization and "Auto-X-Clock", so I let it keep doing it's thing.  It was really stressing me out, as it did this optimization routine for about 5 minutes, best guess is that it was testing different CPU speeds to figure out what the individual CPU's limits were.

 

Finally it finished this routine and displayed a UEFI screen with boot options, so I booted into Windows 10 (not a VM) to see if it was overclocked.  When I opened up CPU-Z I was shocked when it said my 1800X was now running at 4.48GHz!!!  What!!!  I had hopes, but nothing like that.  I couldn't believe it.  I had to open up AIDA-64 for more detail, and sure enough all 8 cores were hovering at 4.47GHz.  With excitement I started up 3DMark Time Spy.  I knew there was no way it would make it through without crashing, but sure enough it finished, with a score of 27203.  Seems pretty good to me.  I bet I could crack 30k with a 1080Ti, but I'm not complaining about my old GTX 670.  I ran the AIDA stability test and SuperPi, basically anything I could to get it to crash, but this thing is running solid.

 

My Noctua cooler seems to be doing good too, still keeping my temps below 45c at these speeds.  Normally I see temps in the low 30's, so it is definitely running warmer, but I think mid 40's is okay.  Even better is when I looked at my Kill-a-Watt power meter.  I figured it had to have gone up from my usual idle wattage of about 40-45w, but no, somehow it was hovering between 14w-17w at idle.  This is amazing!

 

I'm not sure if this is the new AGESA microcode AMD is releasing, or something special ASRock cooked up.  Hopefully this isn't an ASRock only feature.  At this speeds, the intel i7-7700k is toast!

 

Oh, almost forgot to mention... Happy April Fools everyone!

 

-Paul

Link to comment
9 minutes ago, chadjj said:

Asus released a new BIOS for my Prime x370-Pro so they may have for your board as well.

 

Chad

 

They have, and I've been on it for a couple days.  It's very much ram focussed however, and for all the issues people have had mine has been a nonissue since I got ff the bios that was on the board out of the box.

Link to comment

Just a couple years ago I was reading a fascinating article about the return of 3Dfx, and their new graphics card that was beating up on AMD and nVIDIA's latest and greatest.  I was in shock, I thought 3Dfx was long dead, and here they were with a surprise return out of nowhere, and were now in the performance lead.  I kept thinking this was impossible... but I kept believing what I was reading, with pure fascination and joy... until I got to the Minesweeper FPS charts!!!  I had completely forgotten it was April 1st.  The experience forever changed me,  to the point I can't wait for April 2nd to come and all the fake news to end.  I now have a really hard time believing anything I read on the 1st of April.

 

So, being April Fools and all, I had to get the big prank out of the way earlier before sharing this news.  Otherwise I was going to wait until tomorrow to share this development, in case someone thought that this was a prank.  This one is no joke.

 

It seems that my unRAID stability issues are resolved by disabling "Global C-state Control".  In the ASRock BIOS you can find it under:

 

  • Advanced  ->  AMD CBS  ->  Zen Common Options  ->  Global C-state Control.

 

I've read elsewhere that on other motherboards you should still find it in the Zen Common Options section, so look for it there in your own BIOS.

 

Here are my last 8 tests, the last 4 of which I've only been toggling just this one setting:

  • Enabled  = <4 hours, crash
  • Disabled = 55+ hours, System auto-shutdown when UPS battery drained during power outage.
  • Enabled  = <90 minutes, crash
  • Disabled = 10+ hours, got bored and rebooted
  • Enabled  = <30 minutes, crash
  • Disabled = 13+ hours, got bored and rebooted
  • Enabled  = <3.5 hours, crash
  • Disabled = 6+ hours, decided I was done testing

 

Most importantly, the server has never crashed with Global C-state Control disabled.  

 

The only real downside I've found so far is that idle wattage has definitely increased.  With Global C-State Control enabled, I was typically seeing idle wattage of 73w, with it disabled it's now idling at 87w.  That's a big 14w jump for an idle CPU.  (Note, I just tested this again with my 24-drive controller finally installed and all hard drives spun up.  Wattage went from 151w with it enabled, to 158w with it disabled, so a more palatable 7w difference.  Not sure why the better result.)

 

I've also found that, with it enabled, I can get the server to crash pretty quickly simply by ignoring the server (no, not emotionally, silly).  It seems that even bringing up the web page and monitoring the Dashboard creates just enough load to make it take longer to crash, much as running a VM gives the server more work and delays a crash, often by days.  If I completely ignore the server (by closing the web pages so they aren't updating in the background), the server typically crashes within an hour.  Probably because the server is going into lower power states much more readily with nothing to do.

 

Since the problem really does seem to be with power management and the processor entering very low power states,  I've been trying to figure out which C-state exactly might be causing the problem.  I've done some research and matched it up with my testing, and here's what I've got so far:

 

  • According to AMD, if "Cool'N'Quiet is disabled, then low power C-states will also be disabled.  Not sure if this is true, nor exactly which C-states get disabled, but disabling Cool'N'Quiet did not resolve the issue.
  • Disabling "Deep Sleep" (which is a variation on C3, not C4 like I thought earlier) also did not resolve the issue. 
  • Disabling "C6 Mode" (a.k.a. "Deep Power Down" mode) also did not resolve the issue.
  • Only disabling "Global C-state Control" resolved the issue.  I haven't been able to find any answers on exactly what this setting is doing.  I can only assume it is completely disabling all C-states, from C1-C6 (or higher, haven't been able to figure out what C-states Ryzen supports), leaving the CPU always operating in C0.

 

From what I've read, it seems that the processor has to increment through each C-state to reach the next, i.e. C1 -> C2 -> C3 -> C4 -> C6.  You can only get to C6 from C4, and only to C4 from C3.  Since C3 is also known as "Deep Sleep", when I disabled Deep Sleep in my BIOS, I think that means I effectively disabled C3, C4 and C6.  And since that didn't fix the problem, that might point the finger back at C1 or C2.  But that's purely a guess.

 

On the other hand, I've also read that the Linux kernel can override the BIOS at times related to the C-states.  Perhaps the unRAID kernel has simply been compiled with an option that isn't compatible with Ryzen.  I checked the max C-state, (by running cat /sys/module/intel_idle/parameters/max_cstate ) and it came back with a value of 9, for C9.  I have no idea if Ryzen supports C9, but if it does not, then perhaps unRAID is trying to enter C-states the processor doesn't support (the latest Intel CPU's can go up to C10).  Or maybe this means nothing, I really don't know.

 

There's still some pretty big lingering questions:  

  • Why is this C-state stability issue only happening in unRAID, and not other Linux distros? (I'm guessing a kernel compilation option that Lime-Tech is using).  
  • Why does it only affect Ryzen, and not the Bulldozer architecture?  (I'm guessing the compiled options are confused by Zen but not by Bulldozer).  
  • Is this fixable? (I'm thinking it is, but will require Lime-Tech's involvement).
  • Does this even need to be fixed, if we have a BIOS workaround? (To me, yes, because idle power consumption increased)

 

So at this point, I'm done with my testing.  I think I've pointed Lime-Tech in the direction of the C-states, and possibly the kernel compilation options related to C-states.  This has all now greatly exceeded my Linux skill level and I won't be of any additional help.  But in the mean time, at least we have a workaround, so my work here is done.

 

For the very first time, I plugged in my 24-drive controller card and brought up my 42TB array up on Ryzen, and a parity check is chugging along nicely.  It's been down for weeks during all this testing, and I've got a lot of data to get copied over (yesterday was World Backup Day, after all).  Fingers crossed...

 

-Paul

 

P.S. - I'm also attaching two Diagnostics zip files, one with Global C-State Control enabled, and the other with it disabled.  I used Beyond Compare on the two log files, and didn't see much related to ACPI or power managements, but perhaps they will prove useful to someone else.

 

Global_C-state_Control_OFF_tower-diagnostics-20170401-0641.zip

Global_C-state_Control_ON_tower-diagnostics-20170401-0641.zip

  • Upvote 5
Link to comment
11 hours ago, Pauven said:

Just a couple years ago I was reading a fascinating article about the return of 3Dfx, and their new graphics card that was beating up on AMD and nVIDIA's latest and greatest.  I was in shock, I thought 3Dfx was long dead, and here they were with a surprise return out of nowhere, and were now in the performance lead.  I kept thinking this was impossible... but I kept believing what I was reading, with pure fascination and joy... until I got to the Minesweeper FPS charts!!!  I had completely forgotten it was April 1st.  The experience forever changed me,  to the point I can't wait for April 2nd to come and all the fake news to end.  I now have a really hard time believing anything I read on the 1st of April.

 

So, being April Fools and all, I had to get the big prank out of the way earlier before sharing this news.  Otherwise I was going to wait until tomorrow to share this development, in case someone thought that this was a prank.  This one is no joke.

 

It seems that my unRAID stability issues are resolved by disabling "Global C-state Control".  In the ASRock BIOS you can find it under:

 

  • Advanced  ->  AMD CBS  ->  Zen Common Options  ->  Global C-state Control.

 

I've read elsewhere that on other motherboards you should still find it in the Zen Common Options section, so look for it there in your own BIOS.

 

Here are my last 8 tests, the last 4 of which I've only been toggling just this one setting:

  • Enabled  = <4 hours, crash
  • Disabled = 55+ hours, System auto-shutdown when UPS battery drained during power outage.
  • Enabled  = <90 minutes, crash
  • Disabled = 10+ hours, got bored and rebooted
  • Enabled  = <30 minutes, crash
  • Disabled = 13+ hours, got bored and rebooted
  • Enabled  = <3.5 hours, crash
  • Disabled = 6+ hours, decided I was done testing

 

Most importantly, the server has never crashed with Global C-state Control disabled.  

 

The only real downside I've found so far is that idle wattage has definitely increased.  With Global C-State Control enabled, I was typically seeing idle wattage of 73w, with it disabled it's now idling at 87w.  That's a big 14w jump for an idle CPU.  (Note, I just tested this again with my 24-drive controller finally installed and all hard drives spun up.  Wattage went from 151w with it enabled, to 158w with it disabled, so a more palatable 7w difference.  Not sure why the better result.)

 

I've also found that, with it enabled, I can get the server to crash pretty quickly simply by ignoring the server (no, not emotionally, silly).  It seems that even bringing up the web page and monitoring the Dashboard creates just enough load to make it take longer to crash, much as running a VM gives the server more work and delays a crash, often by days.  If I completely ignore the server (by closing the web pages so they aren't updating in the background), the server typically crashes within an hour.  Probably because the server is going into lower power states much more readily with nothing to do.

 

Since the problem really does seem to be with power management and the processor entering very low power states,  I've been trying to figure out which C-state exactly might be causing the problem.  I've done some research and matched it up with my testing, and here's what I've got so far:

 

  • According to AMD, if "Cool'N'Quiet is disabled, then low power C-states will also be disabled.  Not sure if this is true, nor exactly which C-states get disabled, but disabling Cool'N'Quiet did not resolve the issue.
  • Disabling "Deep Sleep" (which is a variation on C3, not C4 like I thought earlier) also did not resolve the issue. 
  • Disabling "C6 Mode" (a.k.a. "Deep Power Down" mode) also did not resolve the issue.
  • Only disabling "Global C-state Control" resolved the issue.  I haven't been able to find any answers on exactly what this setting is doing.  I can only assume it is completely disabling all C-states, from C1-C6 (or higher, haven't been able to figure out what C-states Ryzen supports), leaving the CPU always operating in C0.

 

From what I've read, it seems that the processor has to increment through each C-state to reach the next, i.e. C1 -> C2 -> C3 -> C4 -> C6.  You can only get to C6 from C4, and only to C4 from C3.  Since C3 is also known as "Deep Sleep", when I disabled Deep Sleep in my BIOS, I think that means I effectively disabled C3, C4 and C6.  And since that didn't fix the problem, that might point the finger back at C1 or C2.  But that's purely a guess.

 

On the other hand, I've also read that the Linux kernel can override the BIOS at times related to the C-states.  Perhaps the unRAID kernel has simply been compiled with an option that isn't compatible with Ryzen.  I checked the max C-state, (by running cat /sys/module/intel_idle/parameters/max_cstate ) and it came back with a value of 9, for C9.  I have no idea if Ryzen supports C9, but if it does not, then perhaps unRAID is trying to enter C-states the processor doesn't support (the latest Intel CPU's can go up to C10).  Or maybe this means nothing, I really don't know.

 

There's still some pretty big lingering questions:  

  • Why is this C-state stability issue only happening in unRAID, and not other Linux distros? (I'm guessing a kernel compilation option that Lime-Tech is using).  
  • Why does it only affect Ryzen, and not the Bulldozer architecture?  (I'm guessing the compiled options are confused by Zen but not by Bulldozer).  
  • Is this fixable? (I'm thinking it is, but will require Lime-Tech's involvement).
  • Does this even need to be fixed, if we have a BIOS workaround? (To me, yes, because idle power consumption increased)

 

So at this point, I'm done with my testing.  I think I've pointed Lime-Tech in the direction of the C-states, and possibly the kernel compilation options related to C-states.  This has all now greatly exceeded my Linux skill level and I won't be of any additional help.  But in the mean time, at least we have a workaround, so my work here is done.

 

For the very first time, I plugged in my 24-drive controller card and brought up my 42TB array up on Ryzen, and a parity check is chugging along nicely.  It's been down for weeks during all this testing, and I've got a lot of data to get copied over (yesterday was World Backup Day, after all).  Fingers crossed...

 

-Paul

 

P.S. - I'm also attaching two Diagnostics zip files, one with Global C-State Control enabled, and the other with it disabled.  I used Beyond Compare on the two log files, and didn't see much related to ACPI or power managements, but perhaps they will prove useful to someone else.

 

Global_C-state_Control_OFF_tower-diagnostics-20170401-0641.zip

Global_C-state_Control_ON_tower-diagnostics-20170401-0641.zip

 

Hi Paul,

 

Hows your up time now that your no longer testing?

 

Thanks,

Chad

Link to comment
27 minutes ago, chadjj said:

 

Hi Paul,

 

Hows your up time now that your no longer testing?

 

Thanks,

Chad

 

13 hours and counting.  It's still early, but over the coming days I expect this number to become more substantial.

 

The parity check completed in 7:35, same exact time as on my previous server.  I wasn't expecting any performance improvements, though certainly nice to see no new performance issues cropped up in the platform change.  Dual Parity had increased my times from 7:20 to 7:35, and I had been curious to see if that was a performance issue on my G1610 that would improve with more horsepower... guess not.

 

Since the parity check completed, the server has been idle for the last 5+ hours.

 

I see that 6.3.3 came out a few days back, with kernel 4.9.19.  I was so busy testing I didn't even notice until last night.  So, should I go ahead and upgrade from 6.3.2, or just let it keep running to see how long it can go without crashing?  I didn't see anything in the release notes relevant to Ryzen, and nothing else that looks worth chasing for my needs.  I'm thinking let it run, as-is, on 6.3.2.

 

-Paul

Link to comment
5 hours ago, Pauven said:

 

13 hours and counting.  It's still early, but over the coming days I expect this number to become more substantial.

 

The parity check completed in 7:35, same exact time as on my previous server.  I wasn't expecting any performance improvements, though certainly nice to see no new performance issues cropped up in the platform change.  Dual Parity had increased my times from 7:20 to 7:35, and I had been curious to see if that was a performance issue on my G1610 that would improve with more horsepower... guess not.

 

Since the parity check completed, the server has been idle for the last 5+ hours.

 

I see that 6.3.3 came out a few days back, with kernel 4.9.19.  I was so busy testing I didn't even notice until last night.  So, should I go ahead and upgrade from 6.3.2, or just let it keep running to see how long it can go without crashing?  I didn't see anything in the release notes relevant to Ryzen, and nothing else that looks worth chasing for my needs.  I'm thinking let it run, as-is, on 6.3.2.

 

-Paul

 

Hi Paul,

 

I completed the update with 6.3.3 yesterday on my old Intel machine and as usual no issues so far.  You are correct that 4.9.19 doesn't seem to have any Ryzen related updates.  I will move my drives back into my Ryzen system make the C-State change and will keep you posted if I run into any crashes.  I looked to see if there were any other settings related to sleep on my board and didn't notice anything.  There is also a BIOS update that came out yesterday so hopefully that cleaned some things up.

 

I'll keep you posted.

 

Thanks,

Chad

 

Edited by chadjj
Link to comment
4 hours ago, chadjj said:

 

Hi Paul,

 

I completed the update with 6.3.3 yesterday on my old Intel machine and as usual no issues so far.  You are correct that 4.9.19 doesn't seem to have any Ryzen related updates.  I will move my drives back into my Ryzen system make the C-State change and will keep you posted if I run into any crashes.  I looked to see if there were any other settings related to sleep on my board and didn't notice anything.  There is also a BIOS update that came out yesterday so hopefully that cleaned some things up.

 

I'll keep you posted.

 

Thanks,

Chad

 

 

Hey Paul,

 

I was away from the house but at some point server became non responsive again.  C-State was disabled, new ASUS Prime x370-PRO BIOS 0515 installed.  I checked the console and it was responsive.  You could type but nothing happened when you did.  Checked the logs nothing was captured.

 

So in my instance disabling C-State did not result in increased up time.  unRAID became unresponsive in under 3 hrs.

 

Chad

Link to comment
18 hours ago, chadjj said:

 

Hey Paul,

 

I was away from the house but at some point server became non responsive again.  C-State was disabled, new ASUS Prime x370-PRO BIOS 0515 installed.  I checked the console and it was responsive.  You could type but nothing happened when you did.  Checked the logs nothing was captured.

 

So in my instance disabling C-State did not result in increased up time.  unRAID became unresponsive in under 3 hrs.

 

Chad

 

Unfortunately, things are not completely rosy on my end either.  I'm seeing some new behaviors.

 

I am experiencing the same console issue, on 3 separate occasions now.  The console is still alive, letting you type, but it never responds.  Interestingly enough, the server is not hung in this state, as I could press the power button, and the server completed a successful shutdown, generating console output as it did so.  Also, the GUI still works and I can still telnet into the server.  So in this case, it seems the console, and only the console, is hanging.  Not sure if it was behaving this way in my earlier tests, as I don't typically go to the basement to check the console.

 

I am also seeing the following message already on the console when it is in this hung state, not sure if it it relevant:

basename: missing operand
Try 'basename --help' for more information

 

Additionally, I have been able to get the server to hard hang twice while trying to install a Linux VM.

 

Yesterday the server had been up for about 13 hours, and I wanted to start playing with VM's again.  I created a new Lubuntu Linux VM, started it up, opened VNC, and selected the option to "Install Lubuntu".  Immediately VNC froze and the server began grinding to a halt, and error messages showed in the Log (see below).  I was able to move from the VM tab to the Main tab in the GUI, but it only partially displayed, then became non-responsive in trying to navigate the GUI.  As described above, the console was allowing me to type, but was not responding to my inputs.  I had a family gathering to go to, so I left it as-is, hoping it was just a temporary slowdown.  Several hours later, when I returned the server and console was fully hung, so I restarted the server.

 

Here's the error messages that made it into the log:

Apr 2 12:02:27 Tower kernel: br0: port 2(vnet0) entered blocking state
Apr 2 12:02:27 Tower kernel: br0: port 2(vnet0) entered disabled state
Apr 2 12:02:27 Tower kernel: device vnet0 entered promiscuous mode
Apr 2 12:02:27 Tower kernel: br0: port 2(vnet0) entered blocking state
Apr 2 12:02:27 Tower kernel: br0: port 2(vnet0) entered forwarding state
Apr 2 12:02:27 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound
Apr 2 12:02:27 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound
Apr 2 12:04:04 Tower kernel: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 2 12:04:04 Tower kernel: 10-...: (1 GPs behind) idle=431/140000000000000/0 softirq=2556796/2556796 fqs=14307 
Apr 2 12:04:04 Tower kernel: 11-...: (1 GPs behind) idle=cf5/140000000000000/0 softirq=1178180/1178180 fqs=14307 
Apr 2 12:04:04 Tower kernel: (detected by 7, t=60002 jiffies, g=1654671, c=1654670, q=58202)
Apr 2 12:04:04 Tower kernel: Task dump for CPU 10:
Apr 2 12:04:04 Tower kernel: CPU 2/KVM R running task 0 7881 1 0x00000008
Apr 2 12:04:04 Tower kernel: ffff88000fa3a640 ffff88101ee97a80 ffffc90013d5bcb0 ffffffffa040ecce
Apr 2 12:04:04 Tower kernel: ffffc90013d5bcc0 ffffffffa042cccd ffff880054fb0000 ffffc90013d5bcd8
Apr 2 12:04:04 Tower kernel: 0000000000208c78 ffff880054fb0000 000035b5337c5fa0 ffffc90013d5bcf8
Apr 2 12:04:04 Tower kernel: Call Trace:
Apr 2 12:04:04 Tower kernel: [<ffffffffa040ecce>] ? kvm_get_rflags+0x15/0x26 [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffffa042cccd>] ? kvm_apic_has_interrupt+0x3e/0x8f [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffffa042cd06>] ? kvm_apic_has_interrupt+0x77/0x8f [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffffa042cea8>] ? kvm_get_apic_interrupt+0xf3/0x1af [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffff813ad654>] ? __delay+0xa/0xc
Apr 2 12:04:04 Tower kernel: [<ffffffffa042c635>] ? wait_lapic_expire+0xdf/0xe4 [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffffa0418602>] ? kvm_arch_vcpu_ioctl_run+0x1e2/0x1165 [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffffa0450aa5>] ? svm_vcpu_load+0xe1/0xe8 [kvm_amd]
Apr 2 12:04:04 Tower kernel: [<ffffffffa0412fc8>] ? kvm_arch_vcpu_load+0xea/0x1a0 [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffffa0408d6a>] ? kvm_vcpu_ioctl+0x178/0x499 [kvm]
Apr 2 12:04:04 Tower kernel: [<ffffffff810ee247>] ? handle_mm_fault+0xbd8/0xf96
Apr 2 12:04:04 Tower kernel: [<ffffffff8112fe72>] ? vfs_ioctl+0x13/0x2f
Apr 2 12:04:04 Tower kernel: [<ffffffff811303a2>] ? do_vfs_ioctl+0x49c/0x50a
Apr 2 12:04:04 Tower kernel: [<ffffffff81138f7f>] ? __fget+0x72/0x7e
Apr 2 12:04:04 Tower kernel: [<ffffffff8113044e>] ? SyS_ioctl+0x3e/0x5c
Apr 2 12:04:04 Tower kernel: [<ffffffff8167d2b7>] ? entry_SYSCALL_64_fastpath+0x1a/0xa9
Apr 2 12:04:04 Tower kernel: Task dump for CPU 11:
Apr 2 12:04:04 Tower kernel: CPU 3/KVM R running task 0 7882 1 0x00000008
Apr 2 12:04:04 Tower kernel: ffff88000fa3b300 ffff88101eed7a80 ffffc90013d63cb0 ffffffffa040ecce
Apr 2 12:04:04 Tower kernel: ffffc90013d63cc0 ffffffffa042cccd ffff8802bb858000 ffffc90013d63cd8
Apr 2 12:04:04 Tower kernel: 000000000020a768 ffff8802bb858000 000035b5333f5df4 ffffc90013d63cf8

Since I was having to restart anyway, I upgraded to 6.3.3.  That hasn't resolved the hung console issue.  As of right now, the server has been up for 17.5+ hours, and appears to be working fine, except for the hung console.  I can browse the shares, read/write data, and telnet into the server.    CPU avg load is 0-1%, memory is 3%, temps are below 40c, fan speeds are good, the GUI is responsive, the only issue seems to be the hung console.

 

So in this mostly working state, I just tried to install Lubuntu again, and this time it got further in the install process, showing the language prompt, but I was unable to click Continue.  After another minute, the log showed the detected stalls/task dump/call trace info again.  I tried to kill the VM with a "Force Stop" in the GUI, and got this:

 

Quote

 

Execution error

 

Failed to terminate process 902 with SIGKILL: Device or resource busy

 

 

I tried again with the same result. 

 

The server then went into a slow death spiral.  Initially I had "top" running in a telnet window, the server wasn't hung, the GUI was still responsive, and CPU load was low.  Top was indicating that there is 1 "zombie" process, but otherwise it looked okay.  Over the next 10 minutes, the GUI got slower and eventually stopped responding, though the telnet session and top were still okay.  Eventually the telnet session froze too.  The log output window showed 4 task dumps for CPU 12, plus another dump each for CPUs 13, 14, and 15.  At this point, the server was fully hung, and I had to restart it again.

 

What's extra odd about all this is I was able to successfully test with a Windows 10 VM with Global C-state Control disabled, though at the time I did have quite a few more settings disabled in the BIOS.  I also didn't test with a Linux VM at the time.  

 

Disabling C-states seems to be a common practice, especially for overclocking, so I wouldn't expect it to be the root cause of these new issues.

 

At this point I'm feeling pretty beaten up and defeated.  

 

I might try a few more tests today, to see if the Lubuntu issue crops up on a fresh boot, and to see if it occurs on a Windows VM.  If it still has those issues, I may try disabling more stuff in the BIOS again to see if I can address the issue.  Beyond that, I'm just about ready to give up.

 

-Paul

Link to comment

Just tried to install Lubuntu in a VM again, and with the server having only been up <5 minutes the install process got much further.  I actually thought it was going to succeed. Eventually I got an error message on the Lubuntu desktop that the "Installer crashed", and then VNC became hung.

 

Then the unRAID log showed stalls were detected on the CPU's and started outputting the Call Traces, same error messages as before.

 

This time the server crashed pretty quickly, within a minute or two.  

 

I can't rule out a defective install ISO, so I'll try with a Windows VM next.

 

-Paul

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.