October 10, 200817 yr There may be one serious problem with it: http://www.phoronix.com/scan.php?page=news_item&px=Njc0Nw.
October 10, 200817 yr I like that site. Never seen it before but very readable. Theres ome changes that may benefit unRAID in this kernel so bring it on after that serious bug is squashed
October 10, 200817 yr There may be one serious problem with it: That's been fixed since rc7. http://www.phoronix.com/scan.php?page=news_item&px=Njc2OA Intel Provides Temporary e1000e Fix http://www.phoronix.com/scan.php?page=news_item&px=Njc1OQ In the Linux 2.6.27 kernel code was a rather serious regression where a faulty driver is killing Intel network hardware. Specifically the e1000 and e1000e network adapters were getting their EEPROM corrupted by the driver, which renders the network interface permanently inoperable unless that non-volatile memory can be restored. The e1000 problem was patched but the Intel e1000e remains problematic. Fortunately, Intel has now provided a workaround so that no further Intel network hardware is damaged. A patch was proposed by Intel last night on the Linux kernel mailing list that prevents the e1000e non-volatile memory (NVM) from being corrupted when the respective Linux driver is loaded. There is no proper fix yet to this situation but Intel is continuing to explore the problem. Intel is also preparing patches that help users with damaged network hardware restore their EEPROM. For the Linux 2.6.28 kernel, Intel will push forward patches that clean up the network driver's use of the hardware/software semaphore.
October 10, 200817 yr We're going to release 4.4-beta3 with 2.6.26.6 kernel & then probably wait until 2.6.27.1. We have been following the e1000 bug with great interest & they claim it is fixed in 2.6.27, but so far I have not wanted to risk bricking one of our motherboards to test this yet
October 10, 200817 yr We're going to release 4.4-beta3 with 2.6.26.6 kernel & then probably wait until 2.6.27.1. We have been following the e1000 bug with great interest & they claim it is fixed in 2.6.27, but so far I have not wanted to risk bricking one of our motherboards to test this yet I think this is a wise decision, in fact I was going to wait myself.
October 15, 200817 yr Author If you are afraid of bricking an E1000, download the Intel patch first, then upgrade the kernel... if the E1000 gets hosed, you already have the EEPROM patch to fix it.
October 15, 200817 yr If you are afraid of bricking an E1000, download the Intel patch first, then upgrade the kernel... if the E1000 gets hosed, you already have the EEPROM patch to fix it. In my case, the e1000 driver is used on the original unRAID recommended Intel Motherboard... I've no idea if it is as easy as you describe to update it's EEPROM. I'm willing to wait a few days longer for a corrected 2.6.27 kernel.
October 21, 200817 yr They appear to have figured out the serious bug that was 'bricking' those Intel EEPROM's that use the e1000e driver, see this thread (very interesting): http://lkml.org/lkml/2008/10/15/337. And also very interesting is the author's comments on the FTRACE code at fault, found just past half way down here: http://lwn.net/Articles/303390/. I have to agree with most that there is no way a memory corruption bug should have been able to overwrite an EEPROM. Seems like poor hardware design here...
October 23, 200817 yr Any news here? I see there's already 2.6.27.3 kernel but still no e1000 bug fixes...
October 23, 200817 yr Author The E1000 bug was FIXED several releases ago in the 2.6.27 kernel... actually it was fixed in RC7.... only older version, prior to RC7, had the bug.
October 23, 200817 yr The E1000 bug was FIXED several releases ago in the 2.6.27 kernel... actually it was fixed in RC7.... only older version, prior to RC7, had the bug. I don't think so. Look at this: commit bc5b8bb64a2dc740d8b99635931e689a8b13daf2 Author: Greg Kroah-Hartman <[email protected]> Date: Wed Oct 15 16:02:53 2008 -0700 Linux 2.6.27.1 commit d23d43386311fde5f11e06c16d4185e94a8d6d06 Author: Steven Rostedt <[email protected]> Date: Wed Oct 15 18:21:44 2008 -0400 disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload While debugging the e1000e corruption bug with Intel, we discovered today that the dynamic ftrace code in mainline is the likely source of this bug. For the stable kernel we are providing the only viable fix patch: labeling CONFIG_DYNAMIC_FTRACE as broken. (see the patch below) We will follow up with a backport patch that contains the fixes. But since the fixes are not a one liner, the safest approach for now is to disable the code in question. The cause of the bug is due to the way the current code in mainline handles dynamic ftrace. When dynamic ftrace is turned on, it also turns on CONFIG_FTRACE which enables the -pg config in gcc that places a call to mcount at every function call. With just CONFIG_FTRACE this causes a noticeable overhead. CONFIG_DYNAMIC_FTRACE works to ease this overhead by dynamically updating the mcount call sites into nops. The problem arises when we trace functions and modules are unloaded. The first time a function is called, it will call mcount and the mcount call will call ftrace_record_ip. This records the calling site and stores it in a preallocated hash table. Later on a daemon will wake up and call kstop_machine and convert any mcount callers into nops. The evolution of this code first tried to do this without the kstop_machine and used cmpxchg to update the callers as they were called. But I was informed that this is dangerous to do on SMP machines if another CPU is running that same code. The solution was to do this with kstop_machine. We still used cmpxchg to test if the code that we are modifying is indeed code that we expect to be before updating it - as a final line of defense. But on 32bit machines, ioremapped memory and modules share the same address space. When a module would load its code into memory and execute some code, that would register the function. On module unload, ftrace incorrectly did not zap these functions from its hash (this was the bug). The cmpxchg could have saved us in most cases (via luck) - but with ioremap-ed memory that was exactly the wrong thing to do - the results of cmpxchg on device memory are undefined. (and will likely result in a write) The pending .28 ftrace tree does not have this bug anymore, as a general push towards more robustness of code patching, this is done differently: we do not use cmpxchg and we do a WARN_ON and turn the tracer off if anything deviates from its expected state. Furthermore, patch sites are statically identified during build time so there's no runtime discovery of dynamic code areas anymore, and no room for code unmaps to cause the hash to become out of date. We believe the fragility of dynamic patching has been sufficiently addressed in the development code via the static patching method, but further suggestions to make it more robust are welcome.
October 23, 200817 yr Author That references the underlying source of the problem... but a fix to prevent the problem for manifesting itself was already in place by RC7. One is the disease, one is the symptom.
October 23, 200817 yr correct me if I'm wrong, but isn't the current fix like putting a bandaid on a wart?
October 23, 200817 yr Author Actually, a bandaid with a drop of salicylic acid on it will cure a wart. It's more like putting a lock on the door, instead of shooting the burglar.
October 23, 200817 yr correct me if I'm wrong, but isn't the current fix like putting a bandaid on a wart? The current "fix" is a disabling of an option that causes the problem until the correct fix can be designed and developed. Apparently, self-modifying code, intended to improve performance, went about modifying some of the network card's eeprom that had been mapped to a portion of memory. This caused the corruption. The self-modifying feature is simply not enabled at this time in the current 2.6.27 release. If a custom kernel is compiled with the feature enabled, it will still be broken, and potentially trash memory once again until a true fix is coded. Joe L.
October 23, 200817 yr and potentially trash memory once again until a true fix is coded. That's what I thought. So it's not really fixed, just disabled/bypassed.
Archived
This topic is now archived and is closed to further replies.