jonp Posted June 27, 2017 Share Posted June 27, 2017 We discovered a thread we found in the Debian mailing list that documents an issue with Intel processors of both the Skylake and Kaby Lake families. You can read the thread yourself for a complete debrief on the issue, but here is the synopsis, as also documented in the thread from the mailing list: Quote This advisory is about a processor/microcode defect recently identified on Intel Skylake and Intel Kaby Lake processors with hyper-threading enabled. This defect can, when triggered, cause unpredictable system behavior: it could cause spurious errors, such as application and system misbehavior, data corruption, and data loss. It was brought to the attention of the Debian project that this defect is known to directly affect some Debian stable users (refer to the end of this advisory for details), thus this advisory. Please note that the defect can potentially affect any operating system (it is not restricted to Debian, and it is not restricted to Linux-based systems). It can be either avoided (by disabling hyper-threading), or fixed (by updating the processor microcode). Due to the difficult detection of potentially affected software, and the unpredictable nature of the defect, all users of the affected Intel processors are strongly urged to take action as recommended by this advisory. Due to the nature of this issue, we are recommending all affected users do the following: Read the Debian mailing list post regarding this issue to confirm your CPU is affected. Check to see if there is a BIOS update available for your hardware. If no BIOS update is available, disable Hyperthreading in your system BIOS immediately. We are looking into providing a way to allow users to apply a microcode update as a workaround that allows you to temporarily patch out of this bug on a per-boot basis, but until that time, users with these systems need to consider it risky to continue using the Hyperthreading feature. Link to comment
giantkingsquid Posted June 27, 2017 Share Posted June 27, 2017 Thanks for posting this. Please do keep us in the loop with any microcode updates that become available for UnRAID. My Skylake based machine has been suffering from intermittent crashes for no discernible reason since the get go, specifically when hyperthreading is enabled, so looking forward to seeing whether a microcode update will fix it. I doubt my board manufacturer will release a Bios update this late in the game Link to comment
SSD Posted June 27, 2017 Share Posted June 27, 2017 10 minutes ago, giantkingsquid said: Thanks for posting this. Please do keep us in the loop with any microcode updates that become available for UnRAID. My Skylake based machine has been suffering from intermittent crashes for no discernible reason since the get go, specifically when hyperthreading is enabled, so looking forward to seeing whether a microcode update will fix it. I doubt my board manufacturer will release a Bios update this late in the game giant - Are you taking the advice to disable hyperthreading? If you were having issues and disabling the hyperthreading resolves them, it would be overwhelmingly likely that you are being impacted by the bug. But if disabling hyperthreading does not resolve the issues, you similarly have pretty overwhelming evidence that the problem lies elsewhere. I remember early in the hyperthreading era, that hyperthreading was actually detrimental to performance, and it was recommended to disable it. I just did another search and it seems very dependent on the nature of the applications that are running whether hyperthreading is of value or not. The more things you run, it seems, the more likely it will have a positive impact. Clearly having 2 real cores gives 1x+1x=2x performance. Having 2 virtual cores per physical core nominally gives 0.5x + 0.5x + 0.5x + 0.5x=2x performance. And you may get 0.6x+0.6x+0.6x+0.6x=2.4x, but you may also get 0.4x+0.4x+0.4x+0.4x=1.6x. You could also get 0.6x+0.6x+0.4x+0.4x=2x Generally I would say that one thread running 2x as fast is going to perform better than 2 threads running 1/2 as fast on a single task. It takes a lot more time and effort to code an application to efficiently implement multiple threads, and the "overhead" of the threading logic could easily eat up performance gains that hyperthreading might deliver. So even if the two HT cores deliver 1.2x the power, the app may run no faster, and could be slower. That same app running on 2 physical cores might go quite a lot faster, than taking one real and one virtual. So what's the OS to do - give an app that wants two threads two real cores on two different cores, or give it a real core with its virtual. If it got 2 different cores, it might power through a processing intensive task 2x as fast. So the jury is far from out whether a particular user would loose anything, or in fact gain something, with the disabling of hyperthreading. Either way - it will likely not be night and day. Link to comment
giantkingsquid Posted June 27, 2017 Share Posted June 27, 2017 I have disabled hyperthreading and the system did not crash for several hours now, but then again it has run for >40 days without crashing as well, but then again it has crashed after a few hours as well. Very difficult to test for. I honestly doubt that this bug is my problem, but it's another box to tick off I suppose. Thanks for the interesting write up. A hypothetical for those in the know: If I had a Debian vm running on unraid, and used the Debian microcode patch at the vm boot, would the vm be patched, the host and vm be patched or nothing be patched? Link to comment
SSD Posted June 27, 2017 Share Posted June 27, 2017 My guess would be it would not be affected. Breaking out of the virtual box to directly access the CPU should not be possible. Link to comment
GHunter Posted June 27, 2017 Share Posted June 27, 2017 I read about this issue yesterday here: http://techreport.com/news/32152/hyper-threading-erratum-rears-its-head-in-skylake-and-kaby-lake I'm still running a Haswell CPU so I'm not affected by it but it was an interesting read. Link to comment
SSD Posted June 27, 2017 Share Posted June 27, 2017 Just a further comment. This must be an edge case of enormous unlikelyhood. Think of the number of computers based on those chips, and the millions upon millions of hours of testing and use. If a specific user were to randomly hit this issue once, it would be unlikely. But to hit it repeatedly within hours of booting - if that were the kind of symptom, it would have been found and fixed long ago IMO. I remember in college there was an assignment to implement a very very basic multi-process "OS". This was on a now ancient Z80 processor. The trick was to interrupt a process (easy to do), and then precisely remember the processor state. Since each thing you do changes the processor state, this was a bit tricky to not destroy the state of the registers as you stored things away. And then to put things back precisely so that the interrupted process was oblivious to the fact out was interrupted. We were running like 10 processes in parallel giving each a time slice to run. The running processes were all the same - a mathematical calculation of some kind. It took probably 20 seconds, displayed the same answer, and looped to do the same thing over and over. I remember it being fun, because you could run several iterations and it would work and then you'd get a weird answer being displayed and you'd have to figure out what you were not preserving. After several test runs I got it ruining consistently, and was letting it run and run, several minutes - perfect. I was the first one in the lab to get it running, and was chatting with the TA when I got a wrong answer displayed. Crap. It had run a long time. Went through the code line by line, with the TA, and it was perfect. Others started finishing and ran theirs for several minutes, but left on for 15 minutes or more, it would have one wrong answer. TA couldn't explain it, and we went home confused. Next class teacher explained what happened. There is a command DAA (decimal adjust accumulator if memory serves) that is dependent on an invisible carry flag that the prior instruction could set. He then explained a method to preserve even that state that aforeto we had no idea existed. So maintaining state is tricky, and even is your think every nuance is covered, it may not be. I have to think this issue is similar to my college experience (in infinitely more complexity) of some extremely nuanced situation whet state is not properly preserved. Link to comment
HellDiverUK Posted June 27, 2017 Share Posted June 27, 2017 Yay i5 - no Hyperthreading nonsense. Link to comment
Nicktdot Posted June 28, 2017 Share Posted June 28, 2017 Interesting. I'm running the upcoming Skylake Purley Xeon . I guess that's what they call the Xeon E5 v5 in the note, however the nomenclature for this upcoming cpu has changed , the current chip has this cpu info: Quote Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz And from the CPU instruction set flag ( http://i.imgur.com/o6Y8LWp.png ) definitely supports HyperThreading (ht), so looks like it's affected by bug. I've hammered the box pretty hard but have not encountered any stability issues.. maybe I should run unRAID on it for a bit Link to comment
squirrellydw Posted June 28, 2017 Share Posted June 28, 2017 Am I right in that I don't have to worry about it with my board Supermicro X10SDV-7TP4F https://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-7TP4F.cfm Link to comment
GHunter Posted June 28, 2017 Share Posted June 28, 2017 55 minutes ago, squirrellydw said: Am I right in that I don't have to worry about it with my board Supermicro X10SDV-7TP4F https://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-7TP4F.cfm It is the CPU family that you have to worry about not the MB. Your CPU is a Broadwell CPU if the specs in your signature are current. So you're fine. Link to comment
squirrellydw Posted June 29, 2017 Share Posted June 29, 2017 10 hours ago, GHunter said: It is the CPU family that you have to worry about not the MB. Your CPU is a Broadwell CPU if the specs in your signature are current. So you're fine. Yeah thats what I meant, Thanks Link to comment
jbartlett Posted June 30, 2017 Share Posted June 30, 2017 Based on the following site, the June microcode update for Windows does not include this fix. https://arstechnica.com/information-technology/2017/06/skylake-kaby-lake-chips-have-a-crash-bug-with-hyperthreading-enabled/ Link to comment
tdallen Posted July 1, 2017 Share Posted July 1, 2017 Hmm. If you were building/recommending a new system right now, what would you do? Link to comment
SSD Posted July 2, 2017 Share Posted July 2, 2017 3 hours ago, tdallen said: Hmm. If you were building/recommending a new system right now, what would you do? It will get fixed. And I believe the loss of hyperthreading is not much of a loss at all. I would buy the processor that made the most sense to me, and if it was impacted, would disable hyperthreading for now. Does anyone have any evidence that hyperthreading makes a significant difference in performance in transcoding, VM execution, or any of the heavy CPU intensive operation people do on their unRAID arrays? I have not found anything more than theoretical. Eight hoses that each carry 1/2 the water of 4 larger ones - seems like we're not going to move much more water. Link to comment
tdallen Posted July 2, 2017 Share Posted July 2, 2017 17 hours ago, bjp999 said: Does anyone have any evidence that hyperthreading makes a significant difference in performance in transcoding Here's a data point. I ran a Handbrake encode on a 24GB .mkv file using the MP4/H.264 Normal profile under Windows 10. It ran on a Core i7-4790 first with Hyperthreading enabled, then with it disabled. Source and target were both on the local SSD. Time with 8 cores was 32:38, time with 4 cores was 40:17. All cores were maxed, disk i/o as low as expected, as was memory utilization. Handbrake encoding logs show that it is aware of the cores it has to work with. So, I retested both and looked in Process Explorer. It was interesting to note that HandbrakeCLI.exe spooled up 39 threads when it had 8 cores and only 29 threads when it had 4 cores. Could Handbrake have worked faster under 4 cores if it had spooled up 39 threads? I doubt it, but it's a variable. So, here's a single data point on a Haswell chip - feel free to draw conclusions or not, your call - 23% performance degradation on a multi-threaded, CPU intensive operation with Hyperthreading disabled. Link to comment
SSD Posted July 2, 2017 Share Posted July 2, 2017 Thanks for running the test @tdallen! It is clear that, at least handbrake, benefits from the extra threads. Why did it pick 39 threads with 8 cores vs 29 with 4? Was there some intelligence? Would the 4 core version have been faster with more threads running on 4 cores? Don't know and doesn't really matter. Clearly, there is a pretty significant advantage with handbrake and probably other apps too. Apps get optimized based on what runs in the real world, and the real world runs hyperthreaded! Certainly seems like worth getting it fixed! And that's what our users really wanted to know. Link to comment
Jeesieword Posted July 3, 2017 Share Posted July 3, 2017 Handbrake encoding logs show that it is aware of the cores it has to work with. Link to comment
Iormangund Posted July 13, 2017 Share Posted July 13, 2017 Don't know if this has already been posted, but intel does seem to have released a fix for linux: https://downloadcenter.intel.com/download/26925/Linux-Processor-Microcode-Data-File?v=t Can this be applied to unraid without waiting for a new unraid version? Link to comment
richardsim7 Posted July 17, 2017 Share Posted July 17, 2017 FYI looks like Gigabyte have released BIOS updates for lots of recent motherboards. Not sure how far back they're supporting the fix Link to comment
killeriq Posted July 19, 2017 Share Posted July 19, 2017 does it include also Apollo Lake? having issues since beginning... Jul 19 04:40:08 unRAIDTower root: Fix Common Problems: Error: Machine Check Events detected on your server Jul 19 04:40:08 unRAIDTower root: mcelog: Family 6 Model 92 CPU: only decoding architectural errors thanks Link to comment
yippy3000 Posted July 21, 2017 Share Posted July 21, 2017 I saw Intel released a public fix for Linux, is this something I can install directly or does it need to be made part of Unraid for it to stay after updates? If it is ok to install, dumb question, but how? Link to comment
richardsim7 Posted July 21, 2017 Share Posted July 21, 2017 1 hour ago, yippy3000 said: I saw Intel released a public fix for Linux, is this something I can install directly or does it need to be made part of Unraid for it to stay after updates? If it is ok to install, dumb question, but how? Check your motherboard's site for a BIOS update if possible Link to comment
yippy3000 Posted July 21, 2017 Share Posted July 21, 2017 There is an update but there aren't any release notes so I asked support and they said they did not think it included the micro-code update. Is there anyway to tell if I am running the fixed micro-code for the CPU after I do the BIOS update? Link to comment
CrashnBrn Posted July 21, 2017 Share Posted July 21, 2017 3 hours ago, yippy3000 said: There is an update but there aren't any release notes so I asked support and they said they did not think it included the micro-code update. Is there anyway to tell if I am running the fixed micro-code for the CPU after I do the BIOS update? Is it a supermicro? I wish they had release notes for bios updates. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.