[Partially SOLVED] Is there an effort to solve the SAS2LP issue? (Tom Question)


TODDLT

Recommended Posts

It was an update to the unRAID driver, not the Marvell controller driver.

 

As far as release timeframe, I think we have been doing a pretty good job with follow up releases since 6 was pushed out (already at 6.1.3, actively working on 6.2 now).  So when I say "soon" here, take it in that context.

 

Will this affect other servers, which do not use Marvell based controllers?

Link to comment
  • Replies 453
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

It was an update to the unRAID driver, not the Marvell controller driver.

 

As far as release timeframe, I think we have been doing a pretty good job with follow up releases since 6 was pushed out (already at 6.1.3, actively working on 6.2 now).  So when I say "soon" here, take it in that context.

 

Will this affect other servers, which do not use Marvell based controllers?

Possibly.

Link to comment

We "may" push out a 6.1.4 if 6.2 internal testing takes too much longer.  I'm hoping it doesn't come to that though. 6.2 is looking mighty sweet. Bunch of features packed in that NO ONE saw coming. ;-). Sorry to tease.

 

If it's relatively easy to isolate just the driver changes and disk tunable settings for nr_requests, this may be a good idea anyway => you'd have a chance to get feedback from users who are having the issue and would know if this indeed resolves it.

 

Link to comment

We "may" push out a 6.1.4 if 6.2 internal testing takes too much longer.  I'm hoping it doesn't come to that though. 6.2 is looking mighty sweet. Bunch of features packed in that NO ONE saw coming. ;-). Sorry to tease.

 

If it's relatively easy to isolate just the driver changes and disk tunable settings for nr_requests, this may be a good idea anyway => you'd have a chance to get feedback from users who are having the issue and would know if this indeed resolves it.

 

+1 I also like to see the outcome of this.

Link to comment

Can you or Tom see a mechanism to explain the rarer but much more serious issue of data corruption with the SAS2LP?  I'm wondering if the issue is not just a delay/backup in the queue processing, but a queue overflow or overwrite of an I/O, that might explain the more serious issues a few users have seen, like false (and repeatable) parity check errors.  Unfortunately, those users are the ones most likely to have sold off or trashed their SAS2LP cards, but they are the ones we most need to test this.

 

I was not affected by the parity check speed issue but I was facing parity sync errors which I had never seen before (Test&Backup Server). Any advise from anyone or even LT how to deal with this?

 

Test it!  Try the change and see if you still have the same issues.  I've been hoping to hear from those like you, with other issues using Marvell controllers, especially the SAS2LP.

 

My own test resulted in almost identical times, essentially 14 hours.  The change in nr_requests resulted in 3.5 minutes faster, which given so many other factors, is not statistically significant.  But it's one more data point that there isn't a negative effect using other controllers (SiI3132 and ASM1061).

 

I tested it and it is still not fixed. I ran two parity checks yesterday and the first reported 220 errors, the second 222 errors (Write corrections to parity disk unchecked).

 

I dont think thats a valid test, not without you posting more details for us to go over.

 

You need to make certain everything is matched up to begin with. You need to make certain none of your drives have any pending sector relocations as well and none of them are suffering. You also need to make certain your memory or memory settings in bios are not at fault so do two full passes of memtest.

 

So at the very begining of the test you need to know your parity should be absolutely correct.

 

Only once youve ruled out everything else can you then move onto parity check. That means doing a correcting parity check at least once after all the drives are examined in detail. Then once that is finished you can do a parity check noncorrecting.

There was no issue with the server at all. It all started with a red ball status at one disk and a rebuild after I upgraded to 6.x

 

However to avoid any unnecessary discussion I ran another memory check yesterday (2 cycles) and it came out as expected: 0 errors. I also don't have any smart errors or pending sectors relocations.

 

Another parity check yesterday reported 222 errors the 2nd time now. So I started the final parity check earlier today this time with the option "Write corrections to parity disk".

Link to comment

Question for the Linux experts ...

 

Can the nr_requests commands ["echo 8 > /sys/block/sdX/queue/nr_requests"]  be put in the GO script, or do these need to be done after the array has started?

 

Since these completely resolve the slow parity check issue on my system, I thought I'd just add them to the GO script ... but want to be sure they'll have the desired impact before doing so.

 

Link to comment

Question for the Linux experts ...

 

Can the nr_requests commands ["echo 8 > /sys/block/sdX/queue/nr_requests"]  be put in the GO script, or do these need to be done after the array has started?

 

Since these completely resolve the slow parity check issue on my system, I thought I'd just add them to the GO script ... but want to be sure they'll have the desired impact before doing so.

You can add it to your go script, should work fine regardless if array is started before or after the nr_requests are changed.

Link to comment

I promise not to clutter up this thread with help requests. I have read through the majority of this thread and I'm pretty certain that I am suffering from the marvell related red-ball and data corruption issue (not sure about the parity check speed issue). I would like to volunteer to be a guinea pig if you guys are still searching for a solution. I have posted logs and info in my own thread here: http://lime-technology.com/forum/index.php?topic=43746.0 but please let me know if there is anything I can provide or try that would help you guys out at all. Other than that, keep up the good work!

Link to comment

Just made an interesting discovery...  In *certain* cases, a firmware update on the drives may also help to eliminate the problem.

 

EG:  The bulk of my drives are ST3000DM001 (firmware 1CH166).  However I also have two that are 9YN166.  Those two are ATA8-ACS vs ACS-2 / ACS-3, and when those two drives are attached to my SAS2LP the issue appears

 

Since I am able to upgrade those drives to 1CH166 it would probably solve the problem.  However, I am loath to upgrade the firmware on anything (particularily drives with data on them), so will not be testing.

Link to comment

While I've never had data loss due to upgrading the firmware on a drive, I can certainly understand your reluctance to do so.  [it's also true that I've never done it on a drive that had data I didn't have backed up somewhere else ... so the risk was minimal.]

 

But it's nevertheless interesting that without those two drives attached you don't have the issue -- certainly seems likely that if they had the newer firmware the issue would disappear.

 

Link to comment

What's actually more interesting is that I've been one of the people stating for the longest time that "there's no problem at all - everything works fine", because I never saw the issue (parity checks always ~120MB/s).  Its only after I rearranged my drives yesterday due to an upgrade that I finally saw the problem.

 

Upon investigating further, I have three drives using ATA-8 ACS (the 2 Seagates, and 1 of my WD30EZRX's [for which a firmware upgrade may also fix the problem]) and they were all on my motherboard's port.  Guess I should buy a lottery ticket now - luck's on my side.

Link to comment

Thought I would add my results I actually have 3 m1015's with 13 drives and parity.  My parity checks had been sitting at 86mb/sec and 50% cpu usage since upgrading to 6 I did the

 

echo 8 > /sys/block/sdb/queue/nr_requests

 

for all my drives and parity speed jumped back up to 120mb/sec with 20%cpu usage I do have an older firmware on my cards that I have been meaning to do but this does not seem to be restricted to a sas2lp.  With the stats before the change I was seeing 1.1gb/sec disk activity during parity check after the change this went up to 1.6-1.7gb/sec disk activity.  During my upgrade I also added 3 disks and an extra m1015 so it was hard to work out what caused it but happy with the increase in speed.

 

Link to comment

I wonder if it's going to help those of us who had crashes with SAS2LP doing parity check?

 

I have two SAS2LP cards and my server's been running rock solid for months on 5.0.6.  I decided to upgrade to 6.1.3 (before seeing this thread  :P) and started seeing the unexplained redball drive issue when doing a parity check.  Both cards are the "Device 9480" variety, and have "ASPM Disabled".  I attached the lscpi output in case there are more details that are important.

 

What I found helpful was updating the firmware on the cards from 1808 to 1812.  You can download the latest firmware and drivers here:

 

http://www.supermicro.com/products/accessories/addon/AOC-SAS2LP-MV8.cfm

 

After updating the firmware, the cards ran more stable on 6.1.3-Pro ... however, my parity check speeds suffered and ultimately I got a redball on a drive about 5-6 hours into the parity check.

 

My next step was to do the nr_requests fix, which brought my parity speeds back up to excellent speeds.  For this test, I was able to complete the parity check without serious incident?  There are some questionable issues in the log, but the parity check returned 0 errors.  Details in the second attachment.

 

Possibly luck, I'm not sure ... but it ran for just over 9 hours with an average speed of 122 MB/sec.  Maybe not hitting the hardware with so many requests has something to do with it, too.

 

Something else ... while on the Supermicro website, I noticed that a new Linux driver for the SAS2LP was released on November 3rd?  Version 4.0.0.1544 is available.  Dunno if Limetech has had a chance to look at that yet, I don't think I saw it mentioned earlier in the thread.

 

I'm happy to run some other tests if needed.  Hardware summary is shown below.

cortex-lspci.txt

cortex-parity-check.txt

Link to comment

6.1.4 did not seem to fix my issue and I was not able to complete a parity_check.  With 6.1.2 and manually setting the nr_request, I was finally able to complete a parity sync though. 

 

For what its worth, the red balls were only ever thrown on my reiser drives and never on drives attached to my SASLP controller.  Other then that, I could not find any patterns and it I would receive the disconnects/read errors seemingly at random times during parity sync/checks. 

 

I've since swapped out my SAS2LP controller for a m1015 and everything now is working as expected.

Link to comment

Tom and I are still experimenting.  Setting the nr_requests to 8 makes a world of difference for parity checks but does this negatively affect performance for normal read/write operations for anyone?

 

I implemented the nr_requests change on my server with 7 data and parity disk with newer SSD cache drive.  The drives are all WD Reds.  I've never had any issues with parity speed.  I don't know if there is a change in parity speed, the monthly parity check will run tomorrow and I will check. 

 

As far as normal read/write operations, there doesn't seem to be any degradation.  I haven't done anything very scientific though.

 

Same here. It usually takes my 3TB parity system 10 hours to complete. Let's see what happens.

 

My parity check finished with no noticeable difference in speed, 8 hours (10 hours was before I removed my 2TB drives, doh).

 

My monthly parity check just completed using unRAID 6.1.4 and there was no noticeable difference in time (which is a good thing).

Link to comment

I found this thread because I'm having this issue with random redball drives either during parity checks or during drive rebuilds.  The drives in question are on a SAS2LP controller.  If I want to replace the controller so I can get rid of this issue for good hopefully, what are the recommended options that will support 4TB drives?

 

Thanks,

 

Doug

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.