dgaschk Posted November 30, 2011 Share Posted November 30, 2011 Does removing that one line fix the problem? Quote Link to comment
WeeboTech Posted November 30, 2011 Share Posted November 30, 2011 To have the bug affect you, you would have to write to the exact same set of blocks (a stripe) as being calculated at that specific moment. As mentioned in some other thread, this bug has been in the "md" driver in all versions of linux for years. It's pretty scary when you think about it silently corrupting something. it disturbs me because the last issue I had happened to occur in the superblock near the start of the drive. The superblock was also over 1GB in size because there were over 250,000 files on that drive. Anytime you come back from an abnormal start up, writes occur when the filesystem is mounted and transactions are replayed. Quote Link to comment
Joe L. Posted November 30, 2011 Share Posted November 30, 2011 To have the bug affect you, you would have to write to the exact same set of blocks (a stripe) as being calculated at that specific moment. As mentioned in some other thread, this bug has been in the "md" driver in all versions of linux for years. It's pretty scary when you think about it silently corrupting something. it disturbs me because the last issue I had happened to occur in the superblock near the start of the drive. The superblock was also over 1GB in size because there were over 250,000 files on that drive. Anytime you come back from an abnormal start up, writes occur when the filesystem is mounted and transactions are replayed. I agree. It is exactly that set of simultaneous writes to the initial blocks that would trip the bug. The parity check that results from a non-clean shutdown is actually a re-construction of parity if parity is out of sync... It could potentially clobber parity, but not data. A subsequent parity check should fix it. The bigger issue is when re-constructing a replacement data drive. It is there that you can get into trouble. Joe L. Quote Link to comment
abs0lut.zer0 Posted November 30, 2011 Share Posted November 30, 2011 SOooo ??? ??? as a novice, the global moderators are starting to scare me... is there ANY way to avoid this error or what are best practices.? thanks Quote Link to comment
Joe L. Posted November 30, 2011 Share Posted November 30, 2011 SOooo ??? ??? as a novice, the global moderators are starting to scare me... is there ANY way to avoid this error or what are best practices.? thanks Don't write to a disk you are re-constructing or replacing until the re-construction is complete. Quote Link to comment
lionelhutz Posted November 30, 2011 Share Posted November 30, 2011 Don't write to a disk you are re-constructing or replacing until the re-construction is complete. I didn't, and I still saw the bug in action. 3 parity errors during the parity check after rebuilding. Peter Quote Link to comment
abs0lut.zer0 Posted November 30, 2011 Share Posted November 30, 2011 SOooo ??? ??? as a novice, the global moderators are starting to scare me... is there ANY way to avoid this error or what are best practices.? thanks Don't write to a disk you are re-constructing or replacing until the re-construction is complete. thanks will do that then question: delete is same as a write ? Quote Link to comment
WeeboTech Posted November 30, 2011 Share Posted November 30, 2011 SOooo ??? ??? as a novice, the global moderators are starting to scare me... is there ANY way to avoid this error or what are best practices.? thanks Don't write to a disk you are re-constructing or replacing until the re-construction is complete. thanks will do that then lionelhutz points out that even that is not safe. It really needs to be resolved. The whole md5sum database idea I had seems to be crucial now for verifying your file integrity. Quote Link to comment
abs0lut.zer0 Posted November 30, 2011 Share Posted November 30, 2011 SOooo ??? ??? as a novice, the global moderators are starting to scare me... is there ANY way to avoid this error or what are best practices.? thanks Don't write to a disk you are re-constructing or replacing until the re-construction is complete. thanks will do that then lionelhutz points out that even that is not safe. It really needs to be resolved. The whole md5sum database idea I had seems to be crucial now for verifying your file integrity. so when can we expect it implemented Quote Link to comment
MortenSchmidt Posted November 30, 2011 Share Posted November 30, 2011 Don't write to a disk you are re-constructing or replacing until the re-construction is complete. Like lionelhurtz, I was also not deliberately writing to the disk. Sabnzbd, Transmission etc. were shut down (otherwise my array won't stop), it was late at night so noone else in the house was accessing anything. I experienced this problem 2 separate times (first rebuild failed in parity checks, same with 2.nd rebuild, it only started working when I stripped my go script, reboot and did the rebuild again. To make things more confusing the rebuild also worked one time with a full go script. I had the errors in the first 0.1%, so may be related to this superblock thing WeeboTech writes about - except I didn't have any 'abnormal' shutdown or startup. I just killed sabnzbs, transmission and twonkyserver, then stopped the array, selected the new disk and started up again. Could this be provoked by any of the 'performance tweak' vm.dirty_xxx settings? I have these in my go script, partly inherited from Purko as I recall: # Performance Tweaks for i in /sys/block/[hs]d? ; do echo 128 > $i/queue/max_sectors_kb ; done 2>/dev/null for i in /sys/block/[hs]d? ; do echo cfq > $i/queue/scheduler ; done 2>/dev/null sysctl -w vm.min_free_kbytes=8192 # sl:2497 sysctl -w vm.dirty_expire_centisecs=900 # sl:3000 tm:100 sysctl -w vm.dirty_writeback_centisecs=300 # sl:500 tm:50 sysctl -w vm.dirty_ratio=20 # sl:10 tm:10 sysctl -w vm.dirty_background_ratio=10 # sl:5 tm:5 More details and syslog from the issues I had are here: http://lime-technology.com/forum/index.php?topic=12884.msg132178#msg132178 Quote Link to comment
WeeboTech Posted November 30, 2011 Share Posted November 30, 2011 As I mentioned before, if a disk is offline for any reason. There mere act of unmounting or mounting writes to the disk. Any kind of journal transactions replayed writes to the disk. (thus updating the superblock at the start of the disk). Quote Link to comment
lionelhutz Posted November 30, 2011 Share Posted November 30, 2011 Isn't this bug new, or at least something else has been changed so it manifests now? I have done a number of disk upgrades on earlier versions and I never had any post parity issues. The first disk upgrade on 4.7 caused 3 parity errors in the post parity check. I am in the camp that says 4.7 as the "stable" release shouldn't have this bug. Quote Link to comment
dgaschk Posted November 30, 2011 Share Posted November 30, 2011 This bug may be manifesting because the unRAID user base is growing. Quote Link to comment
bcbgboy13 Posted November 30, 2011 Share Posted November 30, 2011 Or perhaps the statistics related to the incidence of ECC errors will manifest due to: 1. generally increased size of memory used in the newer systems; 2. increased time for performing these critical operation as the size of the HD has been increased tremendously and the possibility for bit-flips due to either natural occurrence or power glitches on non-UPS protected systems. In that relation I should mention the very high percentage of bad hard drives experienced by some users contrary to limited snippets of industry info - but if one preclears a 2TB HD 3 times it will read 12TB of data (coincidentally this is exactly the statistical value for NRRE - 10E14 bits for the consumer level disks). And this procedure will take 3 and a half days - enough time for power glitches, bit-flips to manifest themself... Now imagine the persons claiming to perform a 6, 7 or more passes on their older hardware...without ECC and UPS.... Quote Link to comment
Joe L. Posted November 30, 2011 Share Posted November 30, 2011 Isn't this bug new, or at least something else has been changed so it manifests now? I have done a number of disk upgrades on earlier versions and I never had any post parity issues. The first disk upgrade on 4.7 caused 3 parity errors in the post parity check. I am in the camp that says 4.7 as the "stable" release shouldn't have this bug. This bug has been in every version of Linux in all the "md" drivers that have been in use for years and (apparently) just recently identified. I'm not even sure it is fixed in the most recent kernels in stock linux "md" driver. If you've run any version of Linux "raid" driver in past years, you too had the same potential to hit this bug. unRAID has had this code (and the bug it inherited) from its very first 1.050930 release version until it was fixed in the recent 5.0beta series. I think it is showing itself more frequently because the hardware is faster, and disks are bigger, the user-base of unRAID is larger, and we are learning more what to look for. Quote Link to comment
WeeboTech Posted November 30, 2011 Share Posted November 30, 2011 I think it is showing itself more frequently because the hardware is faster, and disks are bigger, the user-base of unRAID is larger, and we are learning more what to look for. I would also add, Larger arrays and higher chances of a failure and the need to rebuild a failed disk. Plus the automation of emhttp in that it automounts the disks, thus creating a write immediately even if a parity sync or update is going on. I would love to have "start/stop" array and "mount/unmount" array. as separate options. If a disk is disabled, it requires a start and a mount that way you can decide how you want to handle it. Quote Link to comment
Zaxxan Posted December 4, 2011 Share Posted December 4, 2011 Tom has disclosed a major bug in 4.7 http://lime-technology.com/forum/index.php?topic=13866.0 I believe I've encountered this bug see here: http://lime-technology.com/forum/index.php?topic=12884.msg132178#msg132178 So, where is 4.7.1? It's been 4½ months now, and honestly that's just about 4½ months too long to fix a bug of this severity in the "stable" release branch. I have a drive that's starting to reallocate sectors now and want to rebuild it. Not a happy camper here. Tom hasn't posted in this thread since July 13th so it doesn't look like he has any interest in this version. Quote Link to comment
Joe L. Posted December 4, 2011 Share Posted December 4, 2011 Tom has disclosed a major bug in 4.7 http://lime-technology.com/forum/index.php?topic=13866.0 I believe I've encountered this bug see here: http://lime-technology.com/forum/index.php?topic=12884.msg132178#msg132178 So, where is 4.7.1? It's been 4½ months now, and honestly that's just about 4½ months too long to fix a bug of this severity in the "stable" release branch. I have a drive that's starting to reallocate sectors now and want to rebuild it. Not a happy camper here. Tom hasn't posted in this thread since July 13th so it doesn't look like he has any interest in this version. I'd be willing to guess it is the one he is currently selling if you order a flash drive with unRAID installed. I do agree though... The two known bugs should be fixed in a 4.7.1 patch release. (and that should have occurred months ago when the initial parity/disk-reconstruction bug was discovered and fixed in 5.0beta8) Quote Link to comment
JackBauer Posted December 4, 2011 Share Posted December 4, 2011 A while ago I would have said we should wait it out - get 5 released, then fix 4.7.1. But with the linux kernel problems - it might make sense to take a detour to 4.7.1 while the kernel problems work themselves out. (And I REALLY want 5 - have two 3tb's waiting to be used, and don't want to until 5.0 beta is generating very few issues) Quote Link to comment
glave Posted December 5, 2011 Share Posted December 5, 2011 Is it possible to downgrade to 4.7 from 5.0 beta (14)? NFS issues and drives not spinning down have me wanting to back out. Quote Link to comment
dgaschk Posted December 5, 2011 Share Posted December 5, 2011 Yes it is. Just use the backup of your flash made before the upgrade. Or reverse the install instructions. Quote Link to comment
glave Posted December 5, 2011 Share Posted December 5, 2011 Reverse the install instructions? When I upgraded all I did was replace the bzimage and bzroot, reboot, and then do the permissions setup. Unfortunately, I had a very narrow sighted moment and did not backup the original flash. Quote Link to comment
marcusone Posted January 24, 2012 Share Posted January 24, 2012 Thought I'd bump this up and see if there are any updates on a 4.7.1 release? Quote Link to comment
abs0lut.zer0 Posted January 25, 2012 Share Posted January 25, 2012 Thought I'd bump this up and see if there are any updates on a 4.7.1 release? +1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.