francesco995 Posted January 14, 2018 Share Posted January 14, 2018 17 minutes ago, limetech said: Try adding to your append line in syslinux: video=efifb:off From: https://www.redhat.com/archives/vfio-users/2016-April/msg00236.html Please report if this solves the problem. Thank you for the reply, I tested with UEFI boot off, and that solved the issue, with UEFI back on and video=efifb:off, i have no video output after grub (i suppose it's that parameter) but the vm works fine, no errors in log thank you all again for the help Link to comment
manofcolombia Posted January 14, 2018 Share Posted January 14, 2018 7 minutes ago, zin105 said: Yeah go into /boot/config/ssl/certs. The self signed one will be called "YOURUNRAIDNAME_unraid_bundle.pem" the one without your unRAID name in the beginning is the Letsencrypt one AFAIK. You would just replace the self signed one. Worked for me! Thanks man! Wow I'm dumb, I didn't even look at the hints for "Use SSL/TLS". Half the info I needed was in there already. So just to fact check myself: Generate new pem in /boot/config/ssl/certs Rename certificate_bundle.pem to .old for safe keeping - This should nix the letencrypt cert that I generated earlier while using "auto" Rename <server-name-here>_unraid_bundle.pem to .old for safe keeping - This should nix the self-signed cert with .local fqdn Move generated cert with proper fqdn to /boot/config/ssl/certs, name it <server-name-here>_unraid_bundle.pem Restart nginx service and the nginx startup script should see only my new cert and serve it out? There is another section in the hints about nginx stapling that I am not familiar with, but I assume its just used for prioritizing which cert is used right? Link to comment
arombo Posted January 14, 2018 Share Posted January 14, 2018 I made the upgrade from 6.3.5 to 6.4.0 this morning, and upon reboot I'm getting nothing from my server. I can login as root with my password ok. The command line terminal and GUI load up, but the gui web browser states it can't connect to localhost. I can't ping anything in the terminal window either. I've tried safe mode and there is no change. Please help. I'm very confused, and not sure what to try next. Link to comment
limetech Posted January 14, 2018 Author Share Posted January 14, 2018 41 minutes ago, Pauven said: I have disabled Global C-state Control again, so hopefully I can be stable on 6.4.0 while this is being addressed. I'm a little confused by your post. Without C6 disabled, either via bios or zenstates, and no special kernel options, I would expect Ryzen to hang eventually. Are you saying that, with using zenstates to disable C6 it still crashes? And now you are trying "Disable Global c-state Control"? (which I presume is a bios option?) 45 minutes ago, Pauven said: Here's what I don't understand: If this is an AMD bug, and is AMD's responsibility to fix, why did the fixes that were put into rc7a work so well? And why can't those same fixes be brought forward to 6.4.0? In version -rc7a we were on the 4.12.3 kernel and we enabled these options: CONFIG_RCU_NOCB_CPU: Offload RCU callback processing from boot-selected CPUs CONFIG_RCU_NOCB_CPU_ALL: All CPUs are build_forced no-CBs CPUs This seemed to work to stop Ryzen hangs. But in one of the subsequent kernel releases these options were removed and replaced with: CONFIG_RCU_NOCB_CPU Which we have enabled. There is now a kernel option that needs to be specified: https://bugzilla.kernel.org/show_bug.cgi?id=196683#c13 As you have discovered, disabling C6 state seems to solve the problem as well, though you give up certain benefits of Ryzen by doing so (turbo burst, or whatever they're calling that). Unfortunately, neither solution appears to be 100% for all users. IMHO the original "fix" in -rc7a worked purely by accident. We have been in contact with AMD support and all we got was hand waving. Those jokers know there is a problem but they are keeping silent on the issue, blaming it on bad power supplies. I did notice in 4.14.13 there was a kernel patch to permit micocode update (fam17h is Ryzen): commit 46789641800ca2077acb66c6cbe8e2ce7575113c Author: Tom Lendacky <[email protected]> Date: Thu Nov 30 16:46:40 2017 -0600 x86/microcode/AMD: Add support for fam17h microcode loading commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf upstream. The size for the Microcode Patch Block (MPB) for an AMD family 17h processor is 3200 bytes. Add a #define for fam17h so that it does not default to 2048 bytes and fail a microcode load/update. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Alice Ferrazzi <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Prior to this patch, it has not been possible for AMD to issue a Ryzen microcode update apparently. My theory: AMD will fix Ryzen hang issue in the greater context of providing microcode updates to fix the Spectre vulnerability. This is just a hunch, I have no inside knowledge of the real issue. 1 Link to comment
Squid Posted January 14, 2018 Share Posted January 14, 2018 1 minute ago, arombo said: I made the upgrade from 6.3.5 to 6.4.0 this morning, and upon reboot I'm getting nothing from my server. I can login as root with my password ok. The command line terminal and GUI load up, but the gui web browser states it can't connect to localhost. I can't ping anything in the terminal window either. I've tried safe mode and there is no change. Please help. I'm very confused, and not sure what to try next. From the terminal, enter in diagnostics and then upload the resulting file stored on your flash drive (/logs/...) Or give it a couple of minutes and see what happens. Link to comment
manofcolombia Posted January 14, 2018 Share Posted January 14, 2018 (edited) 13 minutes ago, zin105 said: Exactly! Instead of restarting the Nginx service I just set SSL to No and then Yes again, and I could verify that my new cert was used. Awesome. That worked and now I have the correct fqdn and info from the generated cert, but when I check out the cert in firefox it still looks like its using the letencrypt cert. This is based on the common name and issued by. And I can still reach the gui with the generated address from limetech dns, but I'm assuming that record will get scavenged eventually. Did you see the same results? EDIT: It looks like this should go away on reboot/nginx service restart. If you cat /etc/nginx/nginx.conf it looks like the cert that is being served by nginx is still the letencrypt cert that was generated: ssl_certificate /etc/ssl/certs/unraid_bundle.pem; ssl_certificate_key /etc/ssl/certs/unraid_bundle.pem; ssl_trusted_certificate /etc/ssl/certs/unraid_bundle.pem; It also looks like I may have peaked the interest of limetech by suggesting that functionality be added to edit the details of the cert that is generated with letsencrypt so that we have the option to input our own details as well as opt in and out of limetech dns and autorenewal. Edited January 14, 2018 by manofcolombia Link to comment
Koshy Posted January 14, 2018 Share Posted January 14, 2018 2 hours ago, Squid said: Its not just CA that's causing those errors. So you do seem to have some other problems. If I was going to take a guess, you have CA open (along with Preclear Disk) open on another system / browser tab from before your reboot to install 6.4 I didn't know having multiple tabs of unRAID open was a problem and I wasn't using preclear, I saw preclear showup on the logs so I tried deleting it but it didn't solve the problem then I saw community applications being mentioned in the logs so I tried deleting it and deleting Community Applications did stop the errors, saying that I did reinstall CA later and haven't seen any of the previous errors. Link to comment
arombo Posted January 14, 2018 Share Posted January 14, 2018 16 minutes ago, Squid said: From the terminal, enter in diagnostics and then upload the resulting file stored on your flash drive (/logs/...) Or give it a couple of minutes and see what happens. Ok. Diagnostics ran successfully. Here's the file. monolith-iii-diagnostics-20180114-1037.zip Link to comment
limetech Posted January 14, 2018 Author Share Posted January 14, 2018 10 minutes ago, manofcolombia said: It also looks like I may have peaked the interest of limetech by suggesting that functionality be added to edit the details of the cert that is generated with letsencrypt so that we have the option to input our own details as well as opt in and out of limetech dns and autorenewal. The change would be only to let you configure the TLD for the self-signed certificate. The details of the Let's Encrypt SSL cert we probably will not change. Link to comment
manofcolombia Posted January 14, 2018 Share Posted January 14, 2018 Just now, limetech said: The change would be only to let you configure the TLD for the self-signed certificate. The details of the Let's Encrypt SSL cert we probably will not change. Understandable. TLD input would be very nice on its own. Would it be possible to implement toggling of limetech DNS without ruining the auto renewal or are those tied together in a way that I am not thinking of? Link to comment
Pauven Posted January 14, 2018 Share Posted January 14, 2018 (edited) 27 minutes ago, limetech said: 1 hour ago, Pauven said: I have disabled Global C-state Control again, so hopefully I can be stable on 6.4.0 while this is being addressed. I'm a little confused by your post. Without C6 disabled, either via bios or zenstates, and no special kernel options, I would expect Ryzen to hang eventually. Are you saying that, with using zenstates to disable C6 it still crashes? And now you are trying "Disable Global c-state Control"? (which I presume is a bios option?) Sorry, I was not clear. 'Global C-state Control' is the BIOS setting that must be disabled to prevent Ryzen hangs. To be even more clear, you will also see in the Ryzen BIOS various options to Disable C6 or other C-states - these do not help. Only the 'Global C-state Control' setting has an impact. This is not intuitive that other c-state related settings don't accomplish the same thing. So what I was writing in my post above was that I have now disabled 'Global C-state Control' in my BIOS for the first time since pre-rc7a. And yes, using only ZenStates to disable C6 it still crashes. Perhaps by researching what 'Global C-state Control' does, you can better understand the nature of the problem and solution. 27 minutes ago, limetech said: CONFIG_RCU_NOCB_CPU_ALL: All CPUs are build_forced no-CBs CPUs Seems like that was the magic sauce. Is is possible to see what exactly this option accomplished by reviewing the old source code for 4.12.3? 27 minutes ago, limetech said: I did notice in 4.14.13 there was a kernel patch to permit micocode update (fam17h is Ryzen): Prior to this patch, it has not been possible for AMD to issue a Ryzen microcode update apparently. ... My theory: AMD will fix Ryzen hang issue in the greater context of providing microcode updates to fix the Spectre vulnerability. This is just a hunch, I have no inside knowledge of the real issue. That seems like a big leap of logic. I would imagine that AMD would have eventually provided a mechanism for microcode updates from Linux, and Spectre probably forced their hand to get this done sooner then planned. While it would be nice if AMD provided a Ryzen hang fix via microcode, to me there is not enough info to even speculate on this. Besides, from my testing, Windows and other Linux distributions did not suffer from this issue, plus the CONFIG_RCU_NOCB_CPU_ALL option fixed this issue on unRAID without a microcode update. I also want to point out that I tried extensively to get other linux distros to crash, including running the same slackware distro and kernel version that you were utilizing for unRAID. I never had a crash outside unRAID. I know that "one data-point does not a fact make", and I'm not trying to point fingers, just sharing my experiences with my particular Ryzen system, which has attained 'Canary in a Coal-Mine' status - if there is a problem, my system hangs quicker than the rest, and has only ever hung while running unRAID, and never hung with rc7a or with 'Global C-state Control' disabled. Paul Edited January 14, 2018 by Pauven Link to comment
manofcolombia Posted January 14, 2018 Share Posted January 14, 2018 Just now, zin105 said: I don't know sorry. I don't allow DNS rebind in my router so it I never tried it after I replaced everything. All good. I have a feeling that rebooting would resolve the issue. Or at least having the startup script that was mentioned in the SSL/TLS hint rerun to rebuild the nginx.conf would as well.@limetech Is there a way to invoke the start up script that is mentioned in the SSL/TLS hint section so that I can test to see if my new self-signed cert is properly thrown into the nginx.conf. Currently my nginx.conf has: ssl_certificate /etc/ssl/certs/unraid_bundle.pem; ssl_certificate_key /etc/ssl/certs/unraid_bundle.pem; ssl_trusted_certificate /etc/ssl/certs/unraid_bundle.pem; I know I can manually edit it, but I'd like to emulate the script running on boot to ensure that if a reboot ever occurs that it would work correctly. Link to comment
JorgeB Posted January 14, 2018 Share Posted January 14, 2018 2 hours ago, marcosv said: Any further suggestions on how to figure out where the incompatibility could be? Try using a different flash drive, just to see if the completes the boot. Link to comment
JorgeB Posted January 14, 2018 Share Posted January 14, 2018 1 hour ago, Sean M. said: I've never had an issue with this previously, so I guess my first question is, is there any way to get it back to allowing the cache SSD as ext4? You're likely mistaken about the filesystem, you have non standard unRAID partition, see here: https://lime-technology.com/forums/topic/65494-unraid-os-version-640-stable-release-available/?do=findComment&comment=619388 Link to comment
Sean M. Posted January 14, 2018 Share Posted January 14, 2018 1 minute ago, johnnie.black said: You're likely mistaken about the filesystem, you have non standard unRAID partition, see here: https://lime-technology.com/forums/topic/65494-unraid-os-version-640-stable-release-available/?do=findComment&comment=619388 Thanks, I used the unassigned devices to pull the data off as trurl mentioned above. Loaded it back mounted as cache and went to format through the UI, got the following error. Jan 14 14:01:57 Tower emhttpd: req (29): startState=STARTED&file=&cmdFormat=Format&unmountable_mask=1073741824&confirmFormat=OFF&optionCorrect=correct&csrf_token=**************** Jan 14 14:01:59 Tower emhttpd: shcmd (1127): /sbin/wipefs -a /dev/sdd1 Jan 14 14:01:59 Tower emhttpd: shcmd (1128): mkdir -p /mnt/cache Jan 14 14:01:59 Tower emhttpd: shcmd (1129): mount -t ext4 -o noatime,nodiratime /dev/sdd1 /mnt/cache Jan 14 14:01:59 Tower root: mount: /mnt/cache: wrong fs type, bad option, bad superblock on /dev/sdd1, missing codepage or helper program, or other error. Jan 14 14:01:59 Tower emhttpd: shcmd (1129): exit status: 32 Jan 14 14:01:59 Tower emhttpd: /mnt/cache mount error: No file system Jan 14 14:01:59 Tower emhttpd: shcmd (1130): umount /mnt/cache Jan 14 14:01:59 Tower kernel: EXT4-fs (sdd1): VFS: Can't find ext4 filesystem Jan 14 14:01:59 Tower root: umount: /mnt/cache: not mounted. Jan 14 14:01:59 Tower emhttpd: shcmd (1130): exit status: 32 Jan 14 14:01:59 Tower emhttpd: shcmd (1131): rmdir /mnt/cache Jan 14 14:01:59 Tower emhttpd: Starting services... Jan 14 14:01:59 Tower emhttpd: no mountpoint along path: /mnt/cache Should I just use the terminal instead of the UI or will it not make a difference? Thanks again! Link to comment
dlandon Posted January 14, 2018 Share Posted January 14, 2018 21 minutes ago, arombo said: Ok. Diagnostics ran successfully. Here's the file. monolith-iii-diagnostics-20180114-1037.zip It looks like you have some disks not found. an 14 10:36:37 Tower kernel: mdcmd (1): import 0 Jan 14 10:36:37 Tower kernel: md: import_slot: 0 missing Jan 14 10:36:37 Tower kernel: mdcmd (2): import 1 Jan 14 10:36:37 Tower kernel: md: import_slot: 1 missing Jan 14 10:36:37 Tower kernel: mdcmd (3): import 2 Jan 14 10:36:37 Tower kernel: md: import_slot: 2 missing Jan 14 10:36:37 Tower kernel: mdcmd (4): import 3 Jan 14 10:36:37 Tower kernel: md: import_slot: 3 missing Jan 14 10:36:37 Tower kernel: mdcmd (5): import 4 Jan 14 10:36:37 Tower kernel: md: import_slot: 4 missing Link to comment
pwm Posted January 14, 2018 Share Posted January 14, 2018 14 minutes ago, Pauven said: 39 minutes ago, limetech said: CONFIG_RCU_NOCB_CPU_ALL: All CPUs are build_forced no-CBs CPUs Seems like that was the magic sauce. Is is possible to see what exactly this option accomplished by reviewing the old source code for 4.12.3? That *_ALL just says the option should operate on all processor cores. If the specific number of cores your chip has is specified as option to the kernel, then you can still get the same functionality. It was just evil of the Kernel maintainers to remove that auto logic to have the option automatically operate on all cores. Link to comment
JorgeB Posted January 14, 2018 Share Posted January 14, 2018 5 minutes ago, Sean M. said: Jan 14 14:01:59 Tower emhttpd: shcmd (1129): mount -t ext4 -o noatime,nodiratime /dev/sdd1 /mnt/cache How come it's trying to mount the cache disk with ext4? That's not a supported filesystem, stop the array and change the cache disk filesystem to auto, then reformat. Link to comment
limetech Posted January 14, 2018 Author Share Posted January 14, 2018 10 minutes ago, Pauven said: Perhaps by researching what 'Global C-state Control' does, you can better understand the nature of the problem and solution. Right, you think getting info out of AMD is bad, try getting m/b manufacturers to tell you what their bios is doing. 11 minutes ago, Pauven said: Seems like that was the magic sauce. Is is possible to see what exactly this option accomplished by reviewing the old source code for 4.12.3? Like I said, I think it probably works by accident, meaning it masks the issue, but doesn't fundamentally solve it. Meaning, a later kernel change can cause the hangs to come back, which is exactly what we see. That said, did you try the 'rcu_nocbs=0-11' kernel option? 15 minutes ago, Pauven said: I also want to point out that I tried extensively to get other linux distros to crash, including running the same slackware distro and kernel version that you were utilizing for unRAID. I never had a crash outside unRAID. You're absolutely sure about that? Same kernel? That's not the experience of people posting in the bugzilla topic. Show me a distro that does not hang using 4.14.13 kernel and we can do a comparison of the kernel config. Maybe there is some option we omit (or add) that others don't, but one would expect that magic option to be document in the bugzilla topic right? As for AMD - they have explicitly told us there are bios "fixes" they are rolling out to motherboard manufacturers for this issue. What do you think they are putting in the new bios? Answer: updated microcode. Link to comment
limetech Posted January 14, 2018 Author Share Posted January 14, 2018 8 minutes ago, pwm said: It was just evil of the Kernel maintainers to remove that auto logic to have the option automatically operate on all cores. I thought I read somewhere it originated from a guy at Intel Link to comment
pwm Posted January 14, 2018 Share Posted January 14, 2018 4 minutes ago, limetech said: I thought I read somewhere it originated from a guy at Intel It isn't impossible. I just know the feature was removed in: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764 "Signed-off-by: Paul E. McKenney <[email protected]>" But rcu_nocbs=0-15 as kernel option for a processor with 16 threads should result in the same action as if the kernel had still had the CONFIG_RCU_NOCB_CPU_ALL parameter. Just that it adds the requirement that the system owner must know exactly what parameter value to supply based on processor. Link to comment
Pauven Posted January 14, 2018 Share Posted January 14, 2018 11 minutes ago, limetech said: Right, you think getting info out of AMD is bad, try getting m/b manufacturers to tell you what their bios is doing. Technically, I think the m/b manufacturers are just programming to AMD spec, so the problem is still getting AMD to say what this feature does on their processor. 12 minutes ago, limetech said: Like I said, I think it probably works by accident, meaning it masks the issue, but doesn't fundamentally solve it. Meaning, a later kernel change can cause the hangs to come back, which is exactly what we see. That said, did you try the 'rcu_nocbs=0-11' kernel option? No, I have never tried the rcu_nocbs kernel option. Since I have a Ryzen R7-1800X, I would need 'rcu_nocbs=0-15', correct? I don't know how to apply this, can you point me in the right direction please? 14 minutes ago, limetech said: You're absolutely sure about that? Same kernel? That's not the experience of people posting in the bugzilla topic. To the best of my ability, yes same kernel. But this was 9 months ago, when Ryzen was first released. I think my testing was on the 4.10 or 4.11 branches, and after rc7a I never tested again with any new kernels, as I thought this was a resolved issue. 16 minutes ago, limetech said: Show me a distro that does not hang using 4.14.13 kernel and we can do a comparison of the kernel config. Maybe there is some option we omit (or add) that others don't, but one would expect that magic option to be document in the bugzilla topic right? I'll do ya one better: Point me to a distro that does hang, and I will test it. If it hangs, I'll shut up about it, but if it doesn't... 17 minutes ago, limetech said: As for AMD - they have explicitly told us there are bios "fixes" they are rolling out to motherboard manufacturers for this issue. What do you think they are putting in the new bios? Answer: updated microcode. Is this information public somewhere? I would appreciate a pointer if it is. I am a major AMD shareholder, and don't mind pursuing Investor Relations for an answer to this issue, but I need to know what I'm talking about. Thanks, Paul Link to comment
Joshewing02 Posted January 14, 2018 Share Posted January 14, 2018 I tried installing last night from 3.5 and the GUI took forever to come up. When it did my parity drive was not showing up and neither was my cache. The array did not start so I shut it down and restarted and watched the CLI. It got stuck on rip: isci_task_abort_task+0x18/0x334 for about 5 min. It finally appeared to finish but I could never log back into the GUI even though it showed connected. Had to reflash the flash drive with 3.5 and resetup my parity and array. Luckily I knew what space each drive was before I flashed. Had to resetup plex and lost all my other docker apps. Has anyone dealt with the rip: isci_task_abort_task+0x18/0x334 or know how to fix it? I saw people had this issue in the beta but can’t see where or how they fixed it. I do not see a option for a floppy on my motherboard to disable. Thanks Link to comment
limetech Posted January 14, 2018 Author Share Posted January 14, 2018 13 minutes ago, Pauven said: No, I have never tried the rcu_nocbs kernel option. Since I have a Ryzen R7-1800X, I would need 'rcu_nocbs=0-15', correct? I don't know how to apply this, can you point me in the right direction please? You need to edit syslinux/syslinux.cfg on your usb flash boot device. This can be done by clicking Flash device on Main page or directly via command line. You need to find the section where you see "menu default" - that is the section used for default boot-up. For example, suppose you are booting normally (not console-gui mode), you would change: label unRAID OS menu default kernel /bzimage append initrd=/bzroot to: label unRAID OS menu default kernel /bzimage append initrd=/bzroot rcu_nocbs=0-15 That's it. Then reboot. Link to comment
arombo Posted January 14, 2018 Share Posted January 14, 2018 39 minutes ago, dlandon said: It looks like you have some disks not found. an 14 10:36:37 Tower kernel: mdcmd (1): import 0 Jan 14 10:36:37 Tower kernel: md: import_slot: 0 missing Jan 14 10:36:37 Tower kernel: mdcmd (2): import 1 Jan 14 10:36:37 Tower kernel: md: import_slot: 1 missing Jan 14 10:36:37 Tower kernel: mdcmd (3): import 2 Jan 14 10:36:37 Tower kernel: md: import_slot: 2 missing Jan 14 10:36:37 Tower kernel: mdcmd (4): import 3 Jan 14 10:36:37 Tower kernel: md: import_slot: 3 missing Jan 14 10:36:37 Tower kernel: mdcmd (5): import 4 Jan 14 10:36:37 Tower kernel: md: import_slot: 4 missing Just double checked my connections. Don't think anything is off still. It appears that I've lost my network interfaces. Any idea how to re-add them, or get them detected again? Latest diagnostics log attached. monolith-iii-diagnostics-20180114-1142.zip Link to comment
Recommended Posts