unRAID Server Release 5.0-rc6-r8168-test Available


Recommended Posts

Well that's quite a release name isn't it?  I guess you could say it's a "test" release of a "release candidate"  ::)

 

For that reason, I'm not posting this release via the normal "download" section of the website.  Instead, here's the link for all you forum readers:

http://download.lime-technology.com/download/unRAID%20Server%205.0-rc6-r8168-test%20AiO.zip

 

And here's the md5sum:

41fde527f0bdfa9d9fdded524d21ddda

 

Here's the situation:

 

1. In this release we're going back to the latest linux kernel (3.4.4 as of this post).  It has always been my "policy" to keep fairly up-to-date with linux kernel development.  This is because all the latest h/w support and bug fixes always find their way in the "latest" kernels first.  It's generally up to maintainers to decide what changes they want to retrofit and there are changes I see in later change logs that are not in earlier maintained releases (and probably will never be integrated).

 

The motivation specifically for using this kernel is to solve the "hang" issues encountered with mvsas driver and also mpt2sas driver.  The latter is a big clue to this issue in that the core problem is not necessarily with the hardware driver (mvsas or mpt2sas), but rather with another higher level component called "libata".  There is an entry in the kernel change log that hints about this.

 

My own testing has involved this: I have a "magic" hdd that exhibits this error case: when you spin it down, anywhere from 80-120 seconds later it "glitches" something on the controller such that the controller thinks a hot-plug event has occurred.  This is treated as an "error case" in libata and after running through all it's hoops, the result is the hdd is brought back online and spun up again.

 

Meanwhile, during this error processing, an unRaid-specific daemon is polling the drives to determine if they are spun up or not.  It's this "CHECK POWER MODE" command that sometimes gets "swallowed" during error recovery; and when it gets lost, the daemon never returns and everything grinds to a halt.

 

So I'm able to get this to happen at will and in testing with latest kernel (3.4.4) the error recovery still happens, but the extra commands are never "lost" and everything unwinds correctly.  I'm not ready to say this is the solution to the problem, but I need more data points from the user base to determine if it is.

 

The bad news is, those with LSI chipsets need to let me know if this latest kernel now causes issues for them again.

 

2. Next item in the list has to do with "NFS stale file handles".  I've spent a lot of time learning a great deal of the "fuse" and "nfs client" code in the kernel, and I'm fairly confident the fixes in this release will solve this problem.  Again, need some more testing for this now.

 

3. This release also keeps the Realtek driver, which does compile just fine with kernel 3.4.4.  So if there are no reports of problems, I'll just keep this driver in the build and drop the '-r8168'.

 

4. There's a bug fix in here having to to do with certain file rename situations in the user share file system.

 

5. Oh I also increased the number of supported array devices to 24.  This will be the limit for 5.0, more work needs to be done to past this for 5.1.

 

I'm really sorry all these unresolved problems seem to still exist.  Probably it's not the majority of users who ever see most of these issues.  But once again, I want to thank everyone who supports unRaid.  I will keep trying to improve the code base as fast as possible.

Link to comment
  • Replies 257
  • Created
  • Last Reply

Top Posters In This Topic

Thanks for 24 drives in array (25 drives total with cache it seems). Now i'm going to have to choose between cache and another data drive with my 24 slot case. :P

 

So far everything seems to be working.. however, just like all other RC releases, I am still getting much slower parity speeds than I do on B14 with my SAS2-MV8 cards. I believe SAS-MV8 users have reported the same.

 

B14: 80-100MB/s limited by my drive's speed.

RC6: 40-45MB/s, this is 20 hours with 3TB parity, which is just too long for something I run monthly.

 

I will keep using RC6 for now, just to make sure I no longer get MV8 SAS errors, however i'll probably go back to B14 in a week or two.

Link to comment

Up and running on the rc6 release on my Supermicro C2SEE based array.  It has the R8168 chipset.

 

No issues at all.  It does fix the "file rename" issue.  It showed itself on my server when used with XBMC and .nfo files.

As they were updated on my older server, and "rsync'd" to the newer server, they are rsync'd as a temp name, then renamed.  If created on a different disk (through the user-file-system) and renamed, the original file remained causing duplicate files.

I never could figure out what caused it, but it all makes sense now.

 

I just tried the "rsync" once more, and it worked as expected this time.

 

Joe L.

Link to comment

Thanks Tom!

 

If all reported problems are gone with this release but the LSI issues pop back up; are you considering an additional (older kernel) release alongside 5.0-Final? Just figure if it is only a problem with certain controllers, the majority of unRAID users may be better off with the later kernel rather than an older release which could cause other issues. This way the LSI guys can at least jump up to 5.0 as well until a better solution is found.

 

 

Sent from my iPhone using Tapatalk

Link to comment

Okay, after spindown/spinup, lots of reiserfs and device not ready errors :(

 

Log file attached.

 

 

Edit:

I've reverted to rc5 for the time being, not wishing to keep disks spun up permanently when the server is running 24/7 in 30+ degrees C.

 

I may simply move drives off the LSI controller for now, and go back to testing rc6 since I am affected by the rename and the stale file handle bugs.

 

If I do that, I guess that I could run some debug tests on the LSI controller by sticking a gash drive on that and provoking the error on a drive outside of the array - would that be of any help Tom?

syslog-20120711-114720.txt.zip

Link to comment

Okay, after spindown/spinup, lots of reiserfs and device not ready errors :(

 

Log file attached.

 

 

Edit:

I've reverted to rc5 for the time being, not wishing to keep disks spun up permanently when the server is running 24/7 in 30+ degrees C.

 

I may simply move drives off the LSI controller for now, and go back to testing rc6 since I am affected by the rename and the stale file handle bugs.

 

If I do that, I guess that I could run some debug tests on the LSI controller by sticking a gash drive on that and provoking the error on a drive outside of the array - would that be of any help Tom?

Currently running 3x M1015 LSI cards and have exactly the same symptoms as "PeterB", have rolled back to RC5-r8168 and things are working smoothly again.

Link to comment

 

My own testing has involved this: I have a "magic" hdd that exhibits this error case: when you spin it down, anywhere from 80-120 seconds later it "glitches" something on the controller such that the controller thinks a hot-plug event has occurred.  This is treated as an "error case" in libata and after running through all it's hoops, the result is the hdd is brought back online and spun up again.

 

WD Green? My 5.0 server has several drives with issues but the one that always seems to trip things up is a 2G-EARS that generates a hot-unplug/plug.

 

Anyway, been running the new version since just after you posted. I'm dropping my spin-down timer to 15min again to see how things fare.

Link to comment

Up and running on the rc6 release on my Supermicro C2SEE based array.  It has the R8168 chipset.

 

No issues at all.  It does fix the "file rename" issue.  It showed itself on my server when used with XBMC and .nfo files.

As they were updated on my older server, and "rsync'd" to the newer server, they are rsync'd as a temp name, then renamed.  If created on a different disk (through the user-file-system) and renamed, the original file remained causing duplicate files.

I never could figure out what caused it, but it all makes sense now.

 

I just tried the "rsync" once more, and it worked as expected this time.

 

Joe L.

 

Joe, do you have a link that explains the file rename issue? I run XBMC and want to ensure I'm not getting any issues.

 

thanks.

Mike.

Link to comment

I've now moved all the array drives off the LSI controller.  System is running fine on rc6.

 

My previous problem with stale file handle on a user share with cache disk enabled appears to be resolved.

 

The problem of duplicate files being created by rename also appears to be resolved.

 

If the LSI problem could be beaten into submission then I would be a very, very, happy user (as opposed to just a happy one!!!).

 

If there is anything I can do to assist with the LSI controller problem (using drives outside of the array), then I would be happy to do so.

Link to comment

Joe, do you have a link that explains the file rename issue? I run XBMC and want to ensure I'm not getting any issues.

 

If your setup encounters this problem, you will get 'duplicate file' messages in your system log.

 

good to know.. would still like to educate myself on the issue though :)

Link to comment

Joe, do you have a link that explains the file rename issue? I run XBMC and want to ensure I'm not getting any issues.

It has nothing to do with XBMC, other than refreshing the scraped data on a movie in XBMC will often update the .nfo files. I have XBMC using a "Movies" user-share on my FIRST unRAID server.    When then using "rsync" to mirror the entire "Movies" user share to my SECOND unRAID server, the issue occurs on the SECOND unRAID server. 

 

It is an issue with ANY program that transfers a file as a temporary name while in progress of being created and then re-names it as its true name once the entire file is in place.  the issue showed itself when there is an existing file on the server and the file is being updated and the existing file was on one disk, and the temporary file is created on another (because of the current available space and allocation method) The rename does not overwrite the original file, since the original file is on a different disk.

 

Joe L.

Link to comment

My parity check speed is down from around 80MB/sec with previous versions to 32MB/sec with RC6. I am using a different set of SATA ports - four on mobo, two on a Rocket 620 (PCIe2.0 and SATA 6MB/s).  I wouldn't expect the change of ports to make such a drastic difference in speed.

Link to comment

So parity check speed is halved on RC6 it seems. I thought this happened on earlier RC versions as well?

Not in all cases.

This and LSI controller issues are the main thing to fix outstanding...would that be fair to say?

A solution is on the horizon I think:

http://thread.gmane.org/gmane.linux.scsi/75915

 

Do you have any leads on what could be causing the parity speed to be drastically lower on some systems? It would suck for unRAID to have to branch into 2...3...4 versions to keep everyone happy, but it seems like it's almost needed in order for 5.0 final to happen.

 

Hope it all gets figured out and a working version for all is released.

Link to comment

My system has AOC-SASLP-MV8 with four 2TB green WD drives, another four on the Asrock H67M-ITX mobo.  Parity check took 5 hours 25 minutes (i.e. completely normal). Starts at around 120MB/sec, finishes at about 55MB/sec.  Has been the same with all previous RC builds for me.

Link to comment

 

A solution is on the horizon I think:

http://thread.gmane.org/gmane.linux.scsi/75915

 

 

The key part of this (confirmed by someone else who tested as well):

 

 

"Ok, with a refined test I've been able to reliably reproduce this and I

bisected it back to commit 85ef06d1d252f6a2e73b678591ab71caad4667bb in

Linus' tree (introduced between 3.0 and 3.1):

 

commit 85ef06d1d252f6a2e73b678591ab71caad4667bb

Author: Tejun Heo <tj <at> kernel.org>

Date:  Fri Jul 1 16:17:47 2011 +0200

 

    block: flush MEDIA_CHANGE from drivers on close(2)

 

Prior to the above commit, sleeping disks will spin up as a result of

I/O sent to them.  With the above commit, they don't spin up and

immediately return an I/O failure.

 

That's all the further I've gotten so far.  I'll be happy to test any

patches or suggestions."

 

 

So can this be patched? How do things like this work when it affects all of the kernels going forward as they will contain old (broken) code? Does one make a suggestion to the authors and wait for them to respond and decide whether it warrants a change based on the supposed importance of it to them/the community? If that is the case, then I'd be concerned that this will be ignored and a long waiting game to get officially fixed.

If it was patched however.... 5 final could be around the corner?

 

Great work on the other things though. :)

 

What are your thoughts on the future... forks and hardware dependent versions to get 5 Finalised, or stick to your guns and wait until you can release a universal 5 final?

 

Link to comment

I DO NOT want to see different versions of 5.0 final (as a mod or a user).

 

From the mod side it could/likely would create much more work for us and we have enough trouble getting the info we need when users are running only one version.

 

From the user side multiple version will likely only create confusion and therefore a lot of question/emails/post to limetech and us mods on "which version of unRAID should I be running"

Link to comment

Upgraded from Ver 5.0 beta14 (with Linux 3.3.1 core) to Ver 5.0 rc6 r8168-test  on both my test bed machine and production machine.  (See sig for description of each)

 

Run non-correcting parity check to approximately 5% completion on both system just before and just after upgrading.  No appreciable change in parity speed or time to completion.  No other significant problems encountered to date on the production server but it has only been in operation for a few hours.

 

I will continue to use rc6-r8168-test unless I have a problem which puts me into a situation where I am forced to downgrade.  Running Simple Features and apcupsd 3.14.3 currently.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.