[Partially SOLVED] Is there an effort to solve the SAS2LP issue? (Tom Question)


TODDLT

Recommended Posts

I also noticed that with controllers that show no improvement in speed the CPU utilization during a parity check is lower with the tweak, e.g.:

 

1430SA with 4 SSDs

Default - ~31%

reqs=8 - ~24%

 

Dell H310 with 8 SSDs

Default - ~47%

reqs=8 - ~40%

 

Speed was very similiar with the 1430SA and identical with the H310, so I believe that servers with close to max CPU utilization can show a little speed improvement even if the controller used is not affected.

Link to comment
  • Replies 453
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Tom and I are still experimenting.  Setting the nr_requests to 8 makes a world of difference for parity checks but does this negatively affect performance for normal read/write operations for anyone?

 

I implemented the nr_requests change on my server with 7 data and parity disk with newer SSD cache drive.  The drives are all WD Reds.  I've never had any issues with parity speed.  I don't know if there is a change in parity speed, the monthly parity check will run tomorrow and I will check. 

 

As far as normal read/write operations, there doesn't seem to be any degradation.  I haven't done anything very scientific though.

 

Parity check speed the same with and without the nr_requests modification.

Link to comment

I think the issue isn't just the nr_requests setting ... it's likely a combination of that AND the amount of CPU "horsepower" you have.    With a new system with a reasonably powerful CPU it probably doesn't matter.

 

For example, the system I've had the issue with is an old C2SEA based system with a dual-core Pentium E6300 (Passmark 1708).    When I upgraded this system to v6, the parity check took ~ 30% longer than on v5 ... and the CPU % was 90%+ for most of the parity check.    With the nr_requests modification the speed is back to what it was previously ... and the CPU utilization has dropped to the 50% range (and below).

 

I suspect with high-end CPU's the higher number of simultaneous requests are simply not a problem.

 

Begs the question as to why parity syncs and disk rebuilds aren't an issue ... but I suspect that's because of a difference in the way they're done relative to parity checks.    And as to why v5 doesn't have the issue -- I suspect that's simply because the 32-bit OS doesn't generate as many simultaneous requests.

 

Link to comment

Dumb question...  Do we think this setting will improve parity check speeds with the original SASLP card? (and V6)

 

No idea.  Looking at SuperMicro's specs, it IS based on a Marvell chip (Marvell 6480), but I don't know if that has the same issues as the 9480 chip used in the SAS2LP.

 

Very easy to test => just do a parity check (Just run it for perhaps 10 minutes and update the status);  then change the nr_requests values and repeat the process.

 

Link to comment

Dumb question...  Do we think this setting will improve parity check speeds with the original SASLP card? (and V6)

 

It does improve a little for me, back to v5 speed.

 

SASLP with 8 SSDs (MB/s):

 

V5.0.6	V6.0.0	V6.0.1	V6.1.0	V6.1.1	V6.1.2	V6.1.3 	V6.1.3 (nr_reqs=
80.6	72.4	73.3	70.1	71.0	71.3	71.3	80.4

 

Unfortunately the SASLP is somewhat bandwidth limited, so there’s not a big improvement, but in a big array you can gain 1 or 2 hours.

Link to comment

Apparently the most common controllers that have caused this issue are the SAS2LP and the 1430SA ... but I'm sure there are others (perhaps some motherboards that use Marvell secondary controllers to add extra SATA ports).

 

I didn't see any appreciable difference when running the unRAID tunables tester with nr_requests set to 8 versus 128.  I have drives running on the M/B controllers and on 1430SA controllers on my C2SEA-based system.

Link to comment

 

I suspect with high-end CPU's the higher number of simultaneous requests are simply not a problem.

 

 

 

Are you saying someone with a more powerful CPU shouldn't see a parity speed decrease with or without the changes?

 

My average parity speed is pretty slow, but when I perform a speed tests on the individual drives I get pretty good speeds. The slower drives from 112MB/sec to the faster drives around 150MB/sec. I would think a parity check on 4GB would be a lot faster then 15 hours, but I guess the combination of drives I have is holding it back.

 

Last checked on Mon 02 Nov 2015 02:52:59 PM EST (today), finding 0 errors.

Duration: 15 hours, 12 minutes, 48 seconds. Average speed: 73.0 MB/sec

 

These are the drives. All drives sitting on 2 Dell H310 cards.

Parity        WDC 4001FAEX-00MJRA0_WD-WCC1F0127613 - 4 TB

Disk 1  WDC_WD2002FAEX-007BA0_WD-WMAWP0099064 - 2 TB

Disk 2  HGST_HDN724040ALE640_PK2338P4HE525C - 4 TB

Disk 3  Hitachi_HDS723020BLA642_MN5220F32HHMJK - 2 TB

Disk 4  HGST_HDN724040ALE640_PK2338P4HEPH8C - 4 TB

Disk 5  Hitachi_HDS723020BLA642_MN1220F30803DD - 2 TB

Disk 6  WDC_WD2002FAEX-007BA0_WD-WMAWP0284322 - 2 TB

Disk 7  HGST_HDN724040ALE640_PK1334PCJY9BRS - 4 TB

Disk 8  WDC_WD4001FAEX-00MJRA0_WD-WCC130263021 - 4 TB

Disk 9  WDC_WD4003FZEX-00Z4SA0_WD-WCC130966733 - 4 TB

Disk 10  HGST_HDN724040ALE640_PK2334PCGYD32B - 4 TB

Disk 11  WDC_WD2001FASS-00U0B0_WD-WMAUR0142521 - 2 TB

Disk 12  WDC_WD2002FAEX-007BA0_WD-WCAY01300087 - 2 TB

Link to comment

No, I'm just suggesting that this could be the reason some are seeing varying results.

 

I think there's clearly SOME issue with v6 and Marvell-based controllers ... but based on various user's reports, the slowdown doesn't seem to be consistent.    Clearly with various chipsets, CPU's, disks, etc. this is to be somewhat expected, but even with very similar systems there are differences => but it does seem that those with faster CPUs are experiencing less slowdown in general (but there are exceptions to that as well).

 

It's difficult to say exactly what the issue is => it's strange that this wasn't a problem with v5;  isn't manifested with parity syncs or drive rebuilds with v6; but has such an impact with v6 parity checks.    What seems clear is that setting nr_requests to 8 resolves the issue for those that are seeing the slowdowns; and doesn't have any negative impact on those that aren't (or on other array operations).

 

The mix of drives is absolutely a major factor in how long a parity check takes => the check can never proceed any faster than the slowest drive that's still involved in the check ... so with older, lower-density drives in the mix, this can be a significant delay.

 

 

 

Link to comment

Does anyone remember that it's not only a slowdown issue? In some configuration parity check with SAS2LP causes a drive to redball and controller to lockup.

 

Yes, I've mentioned it a couple of times.  I've been hoping that users with the more serious problems will test this, and see if this fix makes a difference with those issues too.

Link to comment

Does anyone remember that it's not only a slowdown issue? In some configuration parity check with SAS2LP causes a drive to redball and controller to lockup.

 

Yes, I've mentioned it a couple of times.  I've been hoping that users with the more serious problems will test this, and see if this fix makes a difference with those issues too.

 

I haven't been able to get caught up on my own thread this past week until tonight.  Very excited to see there has been work around/fix found, as I just parked my SAS2LP card in it's original box and gave it time.

 

I couldn't help notice you mention this a couple times and somehow missed it up until tonight.  What do we know about this issue?  Do we know what the problem configurations are?  Is there someway to figure out if you have this issue before it causes a data issue, or is there no way to know until it's too late?

 

If there is some way to "know" if i am safe or not, I'll try the SAS2LP again this weekend and see what happens.

 

Thanks.

Link to comment

I was doing some digging on what this meant... found this statement.

 

nr_requests (RW): This controls how many requests may be allocated in the block layer for read or write requests. Note that the total allocated number may be twice this amount, since it applies only to reads or writes (not the accumulated sum).

 

Here is the page I found it on.

http://www.monperrus.net/martin/scheduler+queue+size+and+resilience+to+heavy+IO

 

I don't fully understand all the implications, but thought I'd post it for anyone that would find it useful.  It at least gave me some idea what we were toying with. 

Does anyone see a down side whatsoever to changing these settings?

Link to comment

I don't see any downside in UnRAID applications.    If we were a reservation system processing thousands of tickets, or some other high-volume transaction system, I suppose it may have a performance impact.    But I don't think anyone's in that boat with their UnRAID systems.

 

In any event, I've tried several things -- multiple active streams;  copying several files at once (from multiple clients); and have seen NO performance impact.

 

But it sure fixes my parity check speeds !!  :)

Link to comment

... HOWEVER => the corruption issue with SAS2LP's certainly needs to be isolated.  It'd be nice if testing shows that this fix resolves that as well.    I don't have that card, so can't help ... and I don't know if those who have had that issue had found a "repeatable sequence" that can be used to confirm whether or not this change resolves it.

 

Link to comment

Can you or Tom see a mechanism to explain the rarer but much more serious issue of data corruption with the SAS2LP?  I'm wondering if the issue is not just a delay/backup in the queue processing, but a queue overflow or overwrite of an I/O, that might explain the more serious issues a few users have seen, like false (and repeatable) parity check errors.  Unfortunately, those users are the ones most likely to have sold off or trashed their SAS2LP cards, but they are the ones we most need to test this.

 

I was not affected by the parity check speed issue but I was facing parity sync errors which I had never seen before (Test&Backup Server). Any advise from anyone or even LT how to deal with this?

 

Test it!  Try the change and see if you still have the same issues.  I've been hoping to hear from those like you, with other issues using Marvell controllers, especially the SAS2LP.

 

My own test resulted in almost identical times, essentially 14 hours.  The change in nr_requests resulted in 3.5 minutes faster, which given so many other factors, is not statistically significant.  But it's one more data point that there isn't a negative effect using other controllers (SiI3132 and ASM1061).

 

I tested it and it is still not fixed. I ran two parity checks yesterday and the first reported 220 errors, the second 222 errors (Write corrections to parity disk unchecked).

Link to comment

Can you or Tom see a mechanism to explain the rarer but much more serious issue of data corruption with the SAS2LP?  I'm wondering if the issue is not just a delay/backup in the queue processing, but a queue overflow or overwrite of an I/O, that might explain the more serious issues a few users have seen, like false (and repeatable) parity check errors.  Unfortunately, those users are the ones most likely to have sold off or trashed their SAS2LP cards, but they are the ones we most need to test this.

 

I was not affected by the parity check speed issue but I was facing parity sync errors which I had never seen before (Test&Backup Server). Any advise from anyone or even LT how to deal with this?

 

Test it!  Try the change and see if you still have the same issues.  I've been hoping to hear from those like you, with other issues using Marvell controllers, especially the SAS2LP.

 

My own test resulted in almost identical times, essentially 14 hours.  The change in nr_requests resulted in 3.5 minutes faster, which given so many other factors, is not statistically significant.  But it's one more data point that there isn't a negative effect using other controllers (SiI3132 and ASM1061).

 

I tested it and it is still not fixed. I ran two parity checks yesterday and the first reported 220 errors, the second 222 errors (Write corrections to parity disk unchecked).

 

I dont think thats a valid test, not without you posting more details for us to go over.

 

You need to make certain everything is matched up to begin with. You need to make certain none of your drives have any pending sector relocations as well and none of them are suffering. You also need to make certain your memory or memory settings in bios are not at fault so do two full passes of memtest.

 

So at the very begining of the test you need to know your parity should be absolutely correct.

 

Only once youve ruled out everything else can you then move onto parity check. That means doing a correcting parity check at least once after all the drives are examined in detail. Then once that is finished you can do a parity check noncorrecting.

Link to comment

Dumb question...  Do we think this setting will improve parity check speeds with the original SASLP card? (and V6)

 

No idea.  Looking at SuperMicro's specs, it IS based on a Marvell chip (Marvell 6480), but I don't know if that has the same issues as the 9480 chip used in the SAS2LP.

 

Very easy to test => just do a parity check (Just run it for perhaps 10 minutes and update the status);  then change the nr_requests values and repeat the process.

 

Ok..  With the default..  I was at 90-95MB/s  With the change I was at ~102-106MB/s

So there is a speed up for the SASLP original card as well.

 

I guess I put this in the go script?

 

Jim

Link to comment

I was doing some digging on what this meant... found this statement.

 

nr_requests (RW): This controls how many requests may be allocated in the block layer for read or write requests. Note that the total allocated number may be twice this amount, since it applies only to reads or writes (not the accumulated sum).

 

Here is the page I found it on.

http://www.monperrus.net/martin/scheduler+queue+size+and+resilience+to+heavy+IO

 

I don't fully understand all the implications, but thought I'd post it for anyone that would find it useful.  It at least gave me some idea what we were toying with. 

Does anyone see a down side whatsoever to changing these settings?

 

Is this the same queue size that was talked about before in another thread regarding the different firmwares flashed to the cards? All I remember is someone said the Dell IT firmware only had a queue depth of 25 and the LSI had a queue depth of 600 which changed the performance. I did use both firmware versions and did not notice any difference in performance.

 

Queue depth

Queue size

Are these the same?

Link to comment

I couldn't help notice you mention this a couple times and somehow missed it up until tonight.  What do we know about this issue?  Do we know what the problem configurations are?  Is there someway to figure out if you have this issue before it causes a data issue, or is there no way to know until it's too late?

 

If there is some way to "know" if i am safe or not, I'll try the SAS2LP again this weekend and see what happens.

Certainly, we can't ask anyone to test something that could cause data issues, so ideally there's someone who can try it on a test system.  Or at least on a system with EVERYTHING backed up elsewhere.  If you do decide to test, we'll be grateful!  But it IS YOUR data, not ours we're risking!  On the other hand, if it has already caused parity issues, where you can't trust the current validity of your parity drive, then you're already in trouble, and have less to lose.  You NEED a parity drive you have confidence in, so this testing has clear benefits over the risks for you.  (not sure how clearly I worded that)

 

I was doing some digging on what this meant...

Here is the page I found it on.

http://www.monperrus.net/martin/scheduler+queue+size+and+resilience+to+heavy+IO

That's an interesting page, but the problem he was dealing with is almost exactly the opposite of ours.  He wanted better performance for high I/O random access drive requests, and found literature that indicated increasing nr_requests allowed NCQ/TCQ to better reorder the requests for optimized head movement.  Our parity and rebuild operations are essentially entirely sequential access, not random access, so there is NO advantage at all in an increase, or any reordering.  And besides, unRAID turns off NCQ, because we found no gains and in some cases significant slowdowns with it on.  (That was quite awhile ago, may have improved by now.)  But NCQ is still useless with sequential access.

 

Our problem seems to be buggy queue handling in the Marvell code, and this 'fix' seems to get the Marvell code to work more safely.

 

In any event, I've tried several things -- multiple active streams;  copying several files at once (from multiple clients); and have seen NO performance impact.

Good test!  If we're going to find any negatives from this, it's probably going to be from this kind of testing, as many data customers as possible sending simultaneous requests, to the same drive.  A local Docker or VM managing a database might be able to generate even more random requests.

 

Time to update the topic title mods?

What would you suggest?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.