unRAID OS version 6.4.0 Stable Release Available


limetech

Recommended Posts

17 minutes ago, limetech said:

 

Try adding to your append line in syslinux:


video=efifb:off

From: https://www.redhat.com/archives/vfio-users/2016-April/msg00236.html

 

Please report if this solves the problem.

 

Thank you for the reply,

 

I tested with UEFI boot off, and that solved the issue,

 

with UEFI back on and video=efifb:off, i have no video output after grub (i suppose it's that parameter) but the vm works fine, no errors in log

 

thank you all again for the help

Link to comment
7 minutes ago, zin105 said:

 

Yeah go into /boot/config/ssl/certs. The self signed one will be called "YOURUNRAIDNAME_unraid_bundle.pem" the one without your unRAID name in the beginning is the Letsencrypt one AFAIK. You would just replace the self signed one. Worked for me!

Thanks man!
Wow I'm dumb, I didn't even look at the hints for "Use SSL/TLS". Half the info I needed was in there already. 

So just to fact check myself:

Generate new pem in /boot/config/ssl/certs
Rename certificate_bundle.pem to .old for safe keeping - This should nix the letencrypt cert that I generated earlier while using "auto"
Rename <server-name-here>_unraid_bundle.pem to .old for safe keeping - This should nix the self-signed cert with .local fqdn
Move generated cert with proper fqdn to /boot/config/ssl/certs, name it <server-name-here>_unraid_bundle.pem
Restart nginx service and the nginx startup script should see only my new cert and serve it out?

There is another section in the hints about nginx stapling that I am not familiar with, but I assume its just used for prioritizing which cert is used right? 

Link to comment

I made the upgrade from 6.3.5 to 6.4.0 this morning, and upon reboot I'm getting nothing from my server.

I can login as root with my password ok.

The command line terminal and GUI load up, but the gui web browser states it can't connect to localhost. I can't ping anything in the terminal window either.

I've tried safe mode and there is no change.

 

Please help. I'm very confused, and not sure what to try next.

 

 

Link to comment
41 minutes ago, Pauven said:

I have disabled Global C-state Control again, so hopefully I can be stable on 6.4.0 while this is being addressed.

 

I'm a little confused by your post.  Without C6 disabled, either via bios or zenstates, and no special kernel options, I would expect Ryzen to hang eventually.

Are you saying that, with using zenstates to disable C6 it still crashes?

And now you are trying "Disable Global c-state Control"?  (which I presume is a bios option?)

 

45 minutes ago, Pauven said:

Here's what I don't understand:  If this is an AMD bug, and is AMD's responsibility to fix, why did the fixes that were put into rc7a work so well?  And why can't those same fixes be brought forward to 6.4.0?

 

In version -rc7a we were on the 4.12.3 kernel and we enabled these options:

CONFIG_RCU_NOCB_CPU: Offload RCU callback processing from boot-selected CPUs
CONFIG_RCU_NOCB_CPU_ALL: All CPUs are build_forced no-CBs CPUs

This seemed to work to stop Ryzen hangs.  But in one of the subsequent kernel releases these options were removed and replaced with:

CONFIG_RCU_NOCB_CPU

Which we have enabled.  There is now a kernel option that needs to be specified:

https://bugzilla.kernel.org/show_bug.cgi?id=196683#c13

 

As you have discovered, disabling C6 state seems to solve the problem as well, though you give up certain benefits of Ryzen by doing so (turbo burst, or whatever they're calling that).  Unfortunately, neither solution appears to be 100% for all users.  IMHO the original "fix" in -rc7a worked purely by accident.

 

We have been in contact with AMD support and all we got was hand waving.  Those jokers know there is a problem but they are keeping silent on the issue, blaming it on bad power supplies.

 

I did notice in 4.14.13 there was a kernel patch to permit micocode update (fam17h is Ryzen):

commit 46789641800ca2077acb66c6cbe8e2ce7575113c
Author: Tom Lendacky <[email protected]>
Date:   Thu Nov 30 16:46:40 2017 -0600

    x86/microcode/AMD: Add support for fam17h microcode loading
    
    commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf upstream.
    
    The size for the Microcode Patch Block (MPB) for an AMD family 17h
    processor is 3200 bytes.  Add a #define for fam17h so that it does
    not default to 2048 bytes and fail a microcode load/update.
    
    Signed-off-by: Tom Lendacky <[email protected]>
    Signed-off-by: Thomas Gleixner <[email protected]>
    Reviewed-by: Borislav Petkov <[email protected]>
    Link: https://lkml.kernel.org/r/[email protected]
    Signed-off-by: Ingo Molnar <[email protected]>
    Cc: Alice Ferrazzi <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

Prior to this patch, it has not been possible for AMD to issue a Ryzen microcode update apparently.

 

My theory: AMD will fix Ryzen hang issue in the greater context of providing microcode updates to fix the Spectre vulnerability.  This is just a hunch, I have no inside knowledge of the real issue.

  • Like 1
Link to comment
1 minute ago, arombo said:

I made the upgrade from 6.3.5 to 6.4.0 this morning, and upon reboot I'm getting nothing from my server.

I can login as root with my password ok.

The command line terminal and GUI load up, but the gui web browser states it can't connect to localhost. I can't ping anything in the terminal window either.

I've tried safe mode and there is no change.

 

Please help. I'm very confused, and not sure what to try next.

 

 

From the terminal, enter in

diagnostics

and then upload the resulting file stored on your flash drive (/logs/...)  Or give it a couple of minutes and see what happens.

Link to comment
13 minutes ago, zin105 said:

 

Exactly! Instead of restarting the Nginx service I just set SSL to No and then Yes again, and I could verify that my new cert was used.

Awesome. That worked and now I have the correct fqdn and info from the generated cert, but when I check out the cert in firefox it still looks like its using the letencrypt cert. This is based on the common name and issued by. And I can still reach the gui with the generated address from limetech dns, but I'm assuming that record will get scavenged eventually.

Did you see the same results?
EDIT: It looks like this should go away on reboot/nginx service restart. If you cat /etc/nginx/nginx.conf it looks like the cert that is being served by nginx is still the letencrypt cert that was generated:

    ssl_certificate          /etc/ssl/certs/unraid_bundle.pem;
    ssl_certificate_key      /etc/ssl/certs/unraid_bundle.pem;
    ssl_trusted_certificate  /etc/ssl/certs/unraid_bundle.pem;



It also looks like I may have peaked the interest of limetech by suggesting that functionality be added to edit the details of the cert that is generated with letsencrypt so that we have the option to input our own details as well as opt in and out of limetech dns and autorenewal.

Edited by manofcolombia
Link to comment
2 hours ago, Squid said:

Its not just CA that's causing those errors.  So you do seem to have some other problems.  If I was going to take a guess, you have CA open (along with Preclear Disk) open on another system / browser tab from before your reboot to install 6.4

 

 

I didn't know having multiple tabs of unRAID open was a problem and I wasn't using preclear, I saw preclear showup on the logs so I tried deleting it but it didn't solve the problem then I saw community applications being mentioned in the logs so I tried deleting it and deleting Community Applications did stop the errors, saying that I did reinstall CA later and haven't seen any of the previous errors.

Link to comment
10 minutes ago, manofcolombia said:

It also looks like I may have peaked the interest of limetech by suggesting that functionality be added to edit the details of the cert that is generated with letsencrypt so that we have the option to input our own details as well as opt in and out of limetech dns and autorenewal.

 

The change would be only to let you configure the TLD for the self-signed certificate.  The details of the Let's Encrypt SSL cert we probably will not change.

Link to comment
Just now, limetech said:

 

The change would be only to let you configure the TLD for the self-signed certificate.  The details of the Let's Encrypt SSL cert we probably will not change.

Understandable. TLD input would be very nice on its own. Would it be possible to implement toggling of limetech DNS without ruining the auto renewal or are those tied together in a way that I am not thinking of?

Link to comment
27 minutes ago, limetech said:
1 hour ago, Pauven said:

I have disabled Global C-state Control again, so hopefully I can be stable on 6.4.0 while this is being addressed.

 

I'm a little confused by your post.  Without C6 disabled, either via bios or zenstates, and no special kernel options, I would expect Ryzen to hang eventually.

Are you saying that, with using zenstates to disable C6 it still crashes?

And now you are trying "Disable Global c-state Control"?  (which I presume is a bios option?)

 

Sorry, I was not clear.

 

'Global C-state Control' is the BIOS setting that must be disabled to prevent Ryzen hangs.  To be even more clear, you will also see in the Ryzen BIOS various options to Disable C6 or other C-states - these do not help.  Only the 'Global C-state Control' setting has an impact.  This is not intuitive that other c-state related settings don't accomplish the same thing.

 

So what I was writing in my post above was that I have now disabled 'Global C-state Control' in my BIOS for the first time since pre-rc7a.  And yes, using only ZenStates to disable C6 it still crashes.

 

Perhaps by researching what 'Global C-state Control' does, you can better understand the nature of the problem and solution.

 

 

27 minutes ago, limetech said:

CONFIG_RCU_NOCB_CPU_ALL: All CPUs are build_forced no-CBs CPUs

 

Seems like that was the magic sauce.  Is is possible to see what exactly this option accomplished by reviewing the old source code for 4.12.3?

 

 

27 minutes ago, limetech said:

I did notice in 4.14.13 there was a kernel patch to permit micocode update (fam17h is Ryzen):

Prior to this patch, it has not been possible for AMD to issue a Ryzen microcode update apparently.

...

My theory: AMD will fix Ryzen hang issue in the greater context of providing microcode updates to fix the Spectre vulnerability.  This is just a hunch, I have no inside knowledge of the real issue.

 

That seems like a big leap of logic.  I would imagine that AMD would have eventually provided a mechanism for microcode updates from Linux, and Spectre probably forced their hand to get this done sooner then planned.

 

While it would be nice if AMD provided a Ryzen hang fix via microcode, to me there is not enough info to even speculate on this.  Besides, from my testing, Windows and other Linux distributions did not suffer from this issue, plus the CONFIG_RCU_NOCB_CPU_ALL option fixed this issue on unRAID without a microcode update.

 

I also want to point out that I tried extensively to get other linux distros to crash, including running the same slackware distro and kernel version that you were utilizing for unRAID.  I never had a crash outside unRAID.  I know that "one data-point does not a fact make", and I'm not trying to point fingers, just sharing my experiences with my particular Ryzen system, which has attained 'Canary in a Coal-Mine' status - if there is a problem, my system hangs quicker than the rest, and has only ever hung while running unRAID, and never hung with rc7a or with 'Global C-state Control' disabled.

 

Paul

Edited by Pauven
Link to comment
Just now, zin105 said:

 

I don't know sorry. I don't allow DNS rebind in my router so it I never tried it after I replaced everything.

All good. I have a feeling that rebooting would resolve the issue. Or at least having the startup script that was mentioned in the SSL/TLS hint rerun to rebuild the nginx.conf would as well.

@limetech Is there a way to invoke the start up script that is mentioned in the SSL/TLS hint section so that I can test to see if my new self-signed cert is properly thrown into the nginx.conf. 

Currently my nginx.conf has:

 

    ssl_certificate          /etc/ssl/certs/unraid_bundle.pem;
    ssl_certificate_key      /etc/ssl/certs/unraid_bundle.pem;
    ssl_trusted_certificate  /etc/ssl/certs/unraid_bundle.pem;

I know I can manually edit it, but I'd like to emulate the script running on boot to ensure that if a reboot ever occurs that it would work correctly.

Link to comment
1 minute ago, johnnie.black said:

You're likely mistaken about the filesystem, you have non standard unRAID partition, see here:

https://lime-technology.com/forums/topic/65494-unraid-os-version-640-stable-release-available/?do=findComment&comment=619388

 

Thanks, I used the unassigned devices to pull the data off as trurl mentioned above.

 

Loaded it back mounted as cache and went to format through the UI, got the following error.

Jan 14 14:01:57 Tower emhttpd: req (29): startState=STARTED&file=&cmdFormat=Format&unmountable_mask=1073741824&confirmFormat=OFF&optionCorrect=correct&csrf_token=****************
Jan 14 14:01:59 Tower emhttpd: shcmd (1127): /sbin/wipefs -a /dev/sdd1
Jan 14 14:01:59 Tower emhttpd: shcmd (1128): mkdir -p /mnt/cache
Jan 14 14:01:59 Tower emhttpd: shcmd (1129): mount -t ext4 -o noatime,nodiratime /dev/sdd1 /mnt/cache
Jan 14 14:01:59 Tower root: mount: /mnt/cache: wrong fs type, bad option, bad superblock on /dev/sdd1, missing codepage or helper program, or other error.
Jan 14 14:01:59 Tower emhttpd: shcmd (1129): exit status: 32
Jan 14 14:01:59 Tower emhttpd: /mnt/cache mount error: No file system
Jan 14 14:01:59 Tower emhttpd: shcmd (1130): umount /mnt/cache
Jan 14 14:01:59 Tower kernel: EXT4-fs (sdd1): VFS: Can't find ext4 filesystem
Jan 14 14:01:59 Tower root: umount: /mnt/cache: not mounted.
Jan 14 14:01:59 Tower emhttpd: shcmd (1130): exit status: 32
Jan 14 14:01:59 Tower emhttpd: shcmd (1131): rmdir /mnt/cache
Jan 14 14:01:59 Tower emhttpd: Starting services...
Jan 14 14:01:59 Tower emhttpd: no mountpoint along path: /mnt/cache

Should I just use the terminal instead of the UI or will it not make a difference?

Thanks again!

Link to comment
21 minutes ago, arombo said:

 

Ok.

 

Diagnostics ran successfully. Here's the file.

monolith-iii-diagnostics-20180114-1037.zip

It looks like you have some disks not found.

an 14 10:36:37 Tower kernel: mdcmd (1): import 0
Jan 14 10:36:37 Tower kernel: md: import_slot: 0 missing
Jan 14 10:36:37 Tower kernel: mdcmd (2): import 1
Jan 14 10:36:37 Tower kernel: md: import_slot: 1 missing
Jan 14 10:36:37 Tower kernel: mdcmd (3): import 2
Jan 14 10:36:37 Tower kernel: md: import_slot: 2 missing
Jan 14 10:36:37 Tower kernel: mdcmd (4): import 3
Jan 14 10:36:37 Tower kernel: md: import_slot: 3 missing
Jan 14 10:36:37 Tower kernel: mdcmd (5): import 4
Jan 14 10:36:37 Tower kernel: md: import_slot: 4 missing

 

Link to comment
14 minutes ago, Pauven said:
39 minutes ago, limetech said:

CONFIG_RCU_NOCB_CPU_ALL: All CPUs are build_forced no-CBs CPUs

 

Seems like that was the magic sauce.  Is is possible to see what exactly this option accomplished by reviewing the old source code for 4.12.3?

That *_ALL just says the option should operate on all processor cores.

If the specific number of cores your chip has is specified as option to the kernel, then you can still get the same functionality.

 

It was just evil of the Kernel maintainers to remove that auto logic to have the option automatically operate on all cores.

Link to comment
5 minutes ago, Sean M. said:

Jan 14 14:01:59 Tower emhttpd: shcmd (1129): mount -t ext4 -o noatime,nodiratime /dev/sdd1 /mnt/cache

 

How come it's trying to mount the cache disk with ext4? That's not a supported filesystem, stop the array and change the cache disk filesystem to auto, then reformat.

Link to comment
10 minutes ago, Pauven said:

Perhaps by researching what 'Global C-state Control' does, you can better understand the nature of the problem and solution.

Right, you think getting info out of AMD is bad, try getting m/b manufacturers to tell you what their bios is doing.

 

11 minutes ago, Pauven said:

Seems like that was the magic sauce.  Is is possible to see what exactly this option accomplished by reviewing the old source code for 4.12.3?

Like I said, I think it probably works by accident, meaning it masks the issue, but doesn't fundamentally solve it.  Meaning, a later kernel change can cause the hangs to come back, which is exactly what we see.  That said, did you try the 'rcu_nocbs=0-11' kernel option?

 

15 minutes ago, Pauven said:

I also want to point out that I tried extensively to get other linux distros to crash, including running the same slackware distro and kernel version that you were utilizing for unRAID.  I never had a crash outside unRAID. 

You're absolutely sure about that? Same kernel?  That's not the experience of people posting in the bugzilla topic.  Show me a distro that does not hang using 4.14.13 kernel and we can do a comparison of the kernel config.  Maybe there is some option we omit (or add) that others don't, but one would expect that magic option to be document in the bugzilla topic right?

 

As for AMD - they have explicitly told us there are bios "fixes" they are rolling out to motherboard manufacturers for this issue.  What do you think they are putting in the new bios?  Answer: updated microcode.

 

Link to comment
4 minutes ago, limetech said:

I thought I read somewhere it originated from a guy at Intel :ph34r:

It isn't impossible.

 

I just know the feature was removed in:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

 

"Signed-off-by: Paul E. McKenney <[email protected]>"

 

But

rcu_nocbs=0-15

as kernel option for a processor with 16 threads should result in the same action as if the kernel had still had the CONFIG_RCU_NOCB_CPU_ALL parameter. Just that it adds the requirement that the system owner must know exactly what parameter value to supply based on processor.

Link to comment
11 minutes ago, limetech said:

Right, you think getting info out of AMD is bad, try getting m/b manufacturers to tell you what their bios is doing.

 

Technically, I think the m/b manufacturers are just programming to AMD spec, so the problem is still getting AMD to say what this feature does on their processor.

 

 

12 minutes ago, limetech said:

Like I said, I think it probably works by accident, meaning it masks the issue, but doesn't fundamentally solve it.  Meaning, a later kernel change can cause the hangs to come back, which is exactly what we see.  That said, did you try the 'rcu_nocbs=0-11' kernel option?

 

No, I have never tried the rcu_nocbs kernel option.  Since I have a Ryzen R7-1800X, I would need 'rcu_nocbs=0-15', correct?  I don't know how to apply this, can you point me in the right direction please?

 

 

14 minutes ago, limetech said:

You're absolutely sure about that? Same kernel?  That's not the experience of people posting in the bugzilla topic. 

 

To the best of my ability, yes same kernel.  But this was 9 months ago, when Ryzen was first released.  I think my testing was on the 4.10 or 4.11 branches, and after rc7a I never tested again with any new kernels, as I thought this was a resolved issue.

 

 

16 minutes ago, limetech said:

Show me a distro that does not hang using 4.14.13 kernel and we can do a comparison of the kernel config.  Maybe there is some option we omit (or add) that others don't, but one would expect that magic option to be document in the bugzilla topic right?

 

I'll do ya one better:  Point me to a distro that does hang, and I will test it.  If it hangs, I'll shut up about it, but if it doesn't...

 

 

17 minutes ago, limetech said:

As for AMD - they have explicitly told us there are bios "fixes" they are rolling out to motherboard manufacturers for this issue.  What do you think they are putting in the new bios?  Answer: updated microcode.

 

Is this information public somewhere?  I would appreciate a pointer if it is.  I am a major AMD shareholder, and don't mind pursuing Investor Relations for an answer to this issue, but I need to know what I'm talking about.

 

Thanks,

Paul

Link to comment

I tried installing last night from 3.5 and the GUI took forever to come up. When it did my parity drive was not showing up and neither was my cache.  The array did not start so I shut it down and restarted and watched the CLI.  It got stuck on

rip: isci_task_abort_task+0x18/0x334 

for about 5 min.  It finally appeared to finish but I could never log back into the GUI even though it showed connected.  Had to reflash the flash drive with 3.5 and resetup my parity and array.  Luckily I knew what space each drive was before I flashed.  Had to resetup plex and lost all my other docker apps.  Has anyone dealt with the

rip: isci_task_abort_task+0x18/0x334 

or know how to fix it?  I saw people had this issue in the beta but can’t see where or how they fixed it.  I do not see a option for a floppy on my motherboard to disable.  Thanks

Link to comment
13 minutes ago, Pauven said:

No, I have never tried the rcu_nocbs kernel option.  Since I have a Ryzen R7-1800X, I would need 'rcu_nocbs=0-15', correct?  I don't know how to apply this, can you point me in the right direction please?

 

You need to edit syslinux/syslinux.cfg on your usb flash boot device.  This can be done by clicking Flash device on Main page or directly via command line.

 

You need to find the section where you see "menu default" - that is the section used for default boot-up.  For example, suppose you are booting normally (not console-gui mode), you would change:

 

label unRAID OS
  menu default
  kernel /bzimage
  append initrd=/bzroot

to:

label unRAID OS
  menu default
  kernel /bzimage
  append initrd=/bzroot rcu_nocbs=0-15

That's it.  Then reboot.

 

Link to comment
39 minutes ago, dlandon said:

It looks like you have some disks not found.


an 14 10:36:37 Tower kernel: mdcmd (1): import 0
Jan 14 10:36:37 Tower kernel: md: import_slot: 0 missing
Jan 14 10:36:37 Tower kernel: mdcmd (2): import 1
Jan 14 10:36:37 Tower kernel: md: import_slot: 1 missing
Jan 14 10:36:37 Tower kernel: mdcmd (3): import 2
Jan 14 10:36:37 Tower kernel: md: import_slot: 2 missing
Jan 14 10:36:37 Tower kernel: mdcmd (4): import 3
Jan 14 10:36:37 Tower kernel: md: import_slot: 3 missing
Jan 14 10:36:37 Tower kernel: mdcmd (5): import 4
Jan 14 10:36:37 Tower kernel: md: import_slot: 4 missing

 

 

Just double checked my connections. Don't think anything is off still.

 

It appears that I've lost my network interfaces. 

Any idea how to re-add them, or get them detected again?

 

Latest diagnostics log attached. 

 

monolith-iii-diagnostics-20180114-1142.zip

Link to comment
  • limetech unpinned and locked this topic
Guest
This topic is now closed to further replies.