unRAID Server Release 4.7 "final" Available


limetech

Recommended Posts

  • Replies 414
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

To have the bug affect you, you would have to write to the exact same set of blocks (a stripe) as being calculated at that specific moment.  As mentioned in some other thread, this bug has been in the "md" driver in all versions of linux for years.

 

It's pretty scary when you think about it silently corrupting something.

it disturbs me because the last issue I had happened to occur in the superblock near the start of the drive.

The superblock was also over 1GB in size because there were over 250,000 files on that drive.

Anytime you come back from an abnormal start up, writes occur when the filesystem is mounted and transactions are replayed.

Link to comment

To have the bug affect you, you would have to write to the exact same set of blocks (a stripe) as being calculated at that specific moment.   As mentioned in some other thread, this bug has been in the "md" driver in all versions of linux for years.

 

It's pretty scary when you think about it silently corrupting something.

it disturbs me because the last issue I had happened to occur in the superblock near the start of the drive.

The superblock was also over 1GB in size because there were over 250,000 files on that drive.

Anytime you come back from an abnormal start up, writes occur when the filesystem is mounted and transactions are replayed.

I agree.  It is exactly that set of simultaneous writes to the initial blocks that would trip the bug. 

The parity check that results from a non-clean shutdown is actually a re-construction of parity if parity is out of sync... It could potentially clobber parity, but not data.  A subsequent parity check should fix it.

 

The bigger issue is when re-constructing a replacement data drive.  It is there that you can get into trouble. 

 

Joe L.

Link to comment

SOooo  ??? ??? ??? ??? as a novice, the global moderators are starting to scare me...

is there ANY way to avoid this error or what are best practices.?

 

thanks

Don't write to a disk you are re-constructing or replacing until the re-construction is complete.

 

thanks will do that then

 

lionelhutz points out that even that is not safe. It really needs to be resolved.

 

The whole md5sum database idea I had seems to be crucial now for verifying your file integrity.

Link to comment

SOooo  ??? ??? ??? ??? as a novice, the global moderators are starting to scare me...

is there ANY way to avoid this error or what are best practices.?

 

thanks

Don't write to a disk you are re-constructing or replacing until the re-construction is complete.

 

thanks will do that then

 

lionelhutz points out that even that is not safe. It really needs to be resolved.

 

The whole md5sum database idea I had seems to be crucial now for verifying your file integrity.

 

so when can we expect it implemented  ;D

Link to comment

Don't write to a disk you are re-constructing or replacing until the re-construction is complete.

 

Like lionelhurtz, I was also not deliberately writing to the disk. Sabnzbd, Transmission etc. were shut down (otherwise my array won't stop), it was late at night so noone else in the house was accessing anything. I experienced this problem 2 separate times (first rebuild failed in parity checks, same with 2.nd rebuild, it only started working when I stripped my go script, reboot and did the rebuild again. To make things more confusing the rebuild also worked one time with a full go script.

 

I had the errors in the first 0.1%, so may be related to this superblock thing WeeboTech writes about - except I didn't have any 'abnormal' shutdown or startup. I just killed sabnzbs, transmission and twonkyserver, then stopped the array, selected the new disk and started up again.

 

Could this be provoked by any of the 'performance tweak' vm.dirty_xxx settings? I have these in my go script, partly inherited from Purko as I recall:

# Performance Tweaks
for i in /sys/block/[hs]d? ; do echo 128 > $i/queue/max_sectors_kb ; done 2>/dev/null
for i in /sys/block/[hs]d? ; do echo cfq > $i/queue/scheduler ; done 2>/dev/null
sysctl -w vm.min_free_kbytes=8192           # sl:2497 
sysctl -w vm.dirty_expire_centisecs=900     # sl:3000  tm:100 
sysctl -w vm.dirty_writeback_centisecs=300  # sl:500   tm:50 
sysctl -w vm.dirty_ratio=20                 # sl:10    tm:10 
sysctl -w vm.dirty_background_ratio=10      # sl:5     tm:5 

 

More details and syslog from the issues I had are here: http://lime-technology.com/forum/index.php?topic=12884.msg132178#msg132178

 

Link to comment

Isn't this bug new, or at least something else has been changed so it manifests now? I have done a number of disk upgrades on earlier versions and I never had any post parity issues. The first disk upgrade on 4.7 caused 3 parity errors in the post parity check.

 

I am in the camp that says 4.7 as the "stable" release shouldn't have this bug.

Link to comment

Or perhaps the statistics related to the incidence of ECC errors will manifest due to:

1. generally increased size of memory used in the newer systems;

2. increased time for performing these critical operation as the size of the HD has been increased tremendously and the possibility for bit-flips due to either natural occurrence or power glitches on non-UPS protected systems.

 

In that relation I should mention the very high percentage of bad hard drives experienced by some users contrary to limited snippets of industry info - but if one preclears a 2TB HD 3 times it will read 12TB of data (coincidentally this is exactly the statistical value for NRRE - 10E14 bits for the consumer level disks). And this procedure will take 3 and a half days - enough time for power glitches, bit-flips to manifest themself... Now imagine the persons claiming to perform a 6, 7 or more passes on their older hardware...without ECC and UPS....

Link to comment

Isn't this bug new, or at least something else has been changed so it manifests now? I have done a number of disk upgrades on earlier versions and I never had any post parity issues. The first disk upgrade on 4.7 caused 3 parity errors in the post parity check.

 

I am in the camp that says 4.7 as the "stable" release shouldn't have this bug.

This bug has been in every version of Linux in all the "md" drivers that have been in use for years and (apparently) just recently identified.  I'm not even sure it is fixed in the most recent kernels in stock linux "md" driver.   If you've run any version of Linux "raid" driver in past years, you too had the same potential to hit this bug.

 

unRAID has had this code (and the bug it inherited) from its very first 1.050930 release version until it was fixed in the recent 5.0beta series.    

 

I think it is showing itself more frequently because the hardware is faster, and disks are bigger, the user-base of unRAID is larger, and we are learning more what to look for.  

Link to comment

I think it is showing itself more frequently because the hardware is faster, and disks are bigger, the user-base of unRAID is larger, and we are learning more what to look for.  

 

I would also add, Larger arrays and higher chances of a failure and the need to rebuild a failed disk.

Plus the automation of emhttp in that it automounts the disks, thus creating a write immediately even if a parity sync or update is going on.

 

I would love to have "start/stop" array and "mount/unmount" array. as separate options.

If a disk is disabled, it requires a start and a mount that way you can decide how you want to handle it.

 

 

 

Link to comment

Tom has disclosed a major bug in 4.7 http://lime-technology.com/forum/index.php?topic=13866.0

 

I believe I've encountered this bug see here: http://lime-technology.com/forum/index.php?topic=12884.msg132178#msg132178

 

So, where is 4.7.1? It's been 4½ months now, and honestly that's just about 4½ months too long to fix a bug of this severity in the "stable" release branch. I have a drive that's starting to reallocate sectors now and want to rebuild it. Not a happy camper here.

 

Tom hasn't posted in this thread since July 13th so it doesn't look like he has any interest in this version.

Link to comment

Tom has disclosed a major bug in 4.7 http://lime-technology.com/forum/index.php?topic=13866.0

 

I believe I've encountered this bug see here: http://lime-technology.com/forum/index.php?topic=12884.msg132178#msg132178

 

So, where is 4.7.1? It's been 4½ months now, and honestly that's just about 4½ months too long to fix a bug of this severity in the "stable" release branch. I have a drive that's starting to reallocate sectors now and want to rebuild it. Not a happy camper here.

 

Tom hasn't posted in this thread since July 13th so it doesn't look like he has any interest in this version.

I'd be willing to guess it is the one he is currently selling if you order a flash drive with unRAID installed.

 

I do agree though...  The two known bugs should be fixed in a 4.7.1 patch release.

(and that should have occurred months ago when the initial parity/disk-reconstruction bug was discovered and fixed in 5.0beta8)

Link to comment

A while ago I would have said we should wait it out - get 5 released, then fix 4.7.1.

 

But with the linux kernel problems - it might make sense to take a detour to 4.7.1 while the kernel problems work themselves out.

 

(And I REALLY want 5 - have two 3tb's waiting to be used, and don't want to until 5.0 beta is generating very few issues)

Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.