avoiding sequential 2x reads and 2x writes on write?


eweitzman

Recommended Posts

First, an introduction. I've just started using unRAID Plus (not Pro) and I like it a lot. It has the right trade-offs for me, and is replacing a slow, dedicated RAID NAS box and some unprotected drives in a PC.

 

I'm a developer. I worked on various unixes (and other OSes) from the mid 80s to mid 90s. Getting ps -aux, ls -lRt, top, and even vi back into L2 has been a trip. (vi twice so for an emacs guy.) I dug up a 1989, spiral bound O'Reilly Nutshell book on BSD 4.3 that I bought back in the day. I'm not a kernel programmer or driver programmer or hardware guy, so the following thoughts may be naive. Clue me in if you can.

 

I've read that unRAID has to do two disk reads and two disk writes for each data chunk when writing a file. See http://lime-technology.com/forum/index.php?topic=4390.60 for Tom's description. That description, and Joe L. and bubbaQ's posts, make it sound like these operations must be done sequentially, with seeks between each op, waiting for the start sector to spin back to the heads, and so on. All this waiting seems unnecessary to me, except for files that only occupy part of a track, and you'll never get high throughput with them anyways because of all the directory activity. With large files, the parts don't necessarily have to be read, written or processed sequentially along the length of the file.

 

Let me illustrate. Multithreaded code could issue 20 synchronous reads, each from a separate thread, at different locations in a file. The drive will optimize how it reads and retrieves that data. When the IO requests complete in somewhat random order and the threads unblock, each can work with it's chunk of data to compute or update parity. After this, the write commands can be issued in any order and again, the drive will reorder the commands to write with the fewest seeks and least rotational latency. It would seem to me that allowing out-of-order processing in the code coupled with smart IO request reordering in the drive firmware can keep the heads moving smoothly through a file for relatively long periods. Of course, there will be some limits imposed by memory limits and interleaved block operations in the code, but if 20 tracks can be processed at a time this way, with only one or two seeks instead of 20, it's a big win.

 

I'm sure this has been investigated, or that there are underlying reasons due to the architecture of unRAID that makes this unfeasible. Anybody know the reasons?

 

Also, I'm very interested in reading an overview of unRAID's architecture: custom daemons, drivers, executables, and so on. Any pointers would be appreciated.

 

Thanks,

- Eric

 

 

Link to comment

IIRC, the code is multithreaded and blocks can be processed out of order -- and as you noted, the drive cache will reorder disk I/O to be more efficient ... but the read of a particular sector, must take place before the write to that sector and the read of both the data disk and the parity disk must take place before the write to parity.  So there is a bundle of 4 operations done by the driver, that are driven by a single write by the OS.

 

What you are describing is a form of queuing and determinative read-ahead (knowing what needs to be read ahead, rather than a more generic (and simple) non-determinative optimistic read-ahead like track buffering).  That requires breaking the bundle, and preventing subsequent writes till the proper reads have take place.  That is a LOT of overhead, particularly in an IRQ driven environment.

 

We already have simple non-determinative optimistic read-ahead... both in the O/S and in disk drives themselves.

 

The value in the system you propose, will lie in the performance increase of complex determinative read-ahead versus simple non-determinative optimistic read-ahead.  With many small writes and fragmented disks, the performance increase will be larger than when dealing with large files and un-fragmented disks.  Most unRAID users are in the latter category.

 

You also will see less performance increase when considering modern disk drives with large on-board caches (i.e WE EARS and new Samsungs with 64MB of on-drive cache).

 

The much simpler approach is a caching disk controller.  You get most of the benefits you describe, w/o having to rewrite any unRAID code.

 

I have been considering that since unRAID was made SMP safe, that pinning the MD driver to one CPU may be faster, as it would eliminate expensive cache invalidations.  One way to test this would be benchmarking writes to a parity protected disk with SMP, then reboot the same system limited to a single CPU and compare benchmarks.

Link to comment

IIRC, the code is multithreaded and blocks can be processed out of order -- and as you noted, the drive cache will reorder disk I/O to be more efficient ... but the read of a particular sector, must take place before the write to that sector and the read of both the data disk and the parity disk must take place before the write to parity.  So there is a bundle of 4 operations done by the driver, that are driven by a single write by the OS.

I see. I've been looking at this as if the unraid code was higher up, ie, not a driver, and had knowledge of what needed to be written beyond a single block or atomic disk operation. If each call to the driver by the OS has no knowledge of previous and forthcoming calls (that is, data to read/write) then it would be very gnarly to have the driver to coordinate with other invocations of it.

 

From what I've gleaned since last night, there are three main parts to unRAID:

 

md driver - unRAID-modified kernel disk driver

shfs - shared file system (user shares?) built on FUSE

emhttp - management utility

 

Any others?

 

- Eric

Link to comment

bubbaQ,

 

Browsing through the driver code, I see it can hold on to ~1200 "stripes" of data. This term must be a legacy from the days when this code was really a RAID driver, right?

 

Anyways, if the driver is aware of 1200 simultaneous IO requests, perhaps some of them can be grouped, reordered and processed so that a large series of data reads on adjacent tracks is done in parallel with a similar series of parity drive reads. That is, if the requested stripes have some sort of addressing that can be mapped to drive geometry, there is the possibility of disk-sequential, deterministic reading instead of "non-determinative optimistic read-ahead." After these complete -- with minimal seeks and no wasted rotations -- then parity will be computed, and then both drives would be written to in the same order as the reads.

 

------

 

Can anyone recommend a pithy summary/overview of linux driver architecture and programming? I'll look at this reordering and batching idea in more detail once I understand the overall picture better.

 

- Eric

Link to comment

... if the requested stripes have some sort of addressing that can be mapped to drive geometry...

 

With modern drives, you can't GET physical drive geometry.  It is translated by the drive.

 

Okay, that was sloppy. Let me rephrase it:

 

...if the requested stripes have some sort of addressing that can be used to order and group them so they can be retrieved sequentially in batches...

 

A cursory glance shows a large buffer (unsigned int memory) allocated for an array of MD_NUM_STRIPE stripe_head structs. stripe_head has sector and state members. state could be used to prepare a list of stripes waiting to be read and written  (ie, that are in the same state), while sector could be used to order them into batches that could be read or written sequentially.

 

At this point, of course, I don't understand the pattern of calls to the driver nor how such batching would be set up. :( One strategy could be to wait after the first IO request for a bit -- perhaps the approximate time of one drive rotation? -- and then processing the first and subsequent calls in a single read/write request if the sectors are numbered sequentially. Overhead and delay like this may be verboten in a driver, though. I dunno. :(

 

- Eric

 

Link to comment

Problem is, the "md" driver still needs to go through the reiserfs and disk's own driver.   The IDE/SATA driver uses the buffer cache, and depending on the disk scheduler that is currently enabled you'll get vastly different results.  

 

Instead of the default "noop" (FIFO) disk scheduling, you might try "anticipatory" or "cfq" queuing mode.

See here: http://www.linux-mag.com/id/7564/1/

and here: http://www.linux-mag.com/id/7564/2/

and here http://blog.carlosgomez.net/2009/10/io-scheduler-and-queue-on-linux.html

 

unRAID 4.5 by default has the "noop" queuing mode enabled.  Disk requests are first-in-first-out.  

 

You can also experiment with the queue size as shown here:

http://www.linux-mag.com/id/7564/1/

 

All of these settings are per disk and must be set each time you reboot.

If you wanted to set all the disks, a small script like these two might work:

 

for i in /sys/block/[sh]d*/queue/scheduler

do

echo anticipatory >$i  

done

 

and

 

for i in /sys/block/[sh]d*/queue/nr_requests

do

echo 5000 >$i  

done

 

Joe L.

Link to comment

Problem is, the "md" driver still needs to go through the reiserfs and disk's own driver.

 

That really changes things... My ignorance of the big picture shows.

 

One could hope that if md batches small, apparently sequential requests and asks for one large transfer down the chain, it would be more efficient. That would need to be tested before spending time on serious coding.

 

- Eric

Link to comment

I was just pointing out there is a lot you can try without coding.  The "Settings" page in unRAID has a field that defaults to NCQ Disable=Yes.   You can enable the disk's native command queuing and see if it can help.    

 

Again, the best choice when playing media or checking parity (mostly all sequential reads) is very different than when writing to disks (interleaved read/write pairs on involved disks. )

 

Also different "might" be the I/O when calculating parity for the first time (or when invalidated) sequential reading from all the data disks, sequential writing to the parity disk.  I said "might" since I did not look in the code to see if unRAID detects this special case and eliminates the read/write pair logic on the parity drive.  It could, since the value being written does not depend on the existing value, but on all the data disks.

 

The current 4.5 unRAID also allows you to tune the number of stripes and write-limit, and sync-window...  More tunable's to play with.   No need to be coding at all...

 

Joe L.

Link to comment
I was just pointing out there is a lot you can try without coding.

 

Yeah, I got what you were saying about how tuning will change performance one way or the other for different scenarios. I will try some of these once the array is loaded. My (waning?) interest is still about the 2xR+2xW thing.

 

Also different "might" be the I/O when calculating parity for the first time (or when invalidated) sequential reading from all the data disks, sequential writing to the parity disk.

 

unraid.c's header talks about two state transitions specifically for computing parity, but I haven't read into the code to see if they perform first time parity calcs differently. I will find out in a few days how it performs when I add a parity drive to my parity-less array with data disks that are being loaded now. The disks will run at near full bandwidth during parity calc or a lot lower depending.

 

The current 4.5 unRAID also allows you to tun the number of stripes and write-limit, and sync-window...  More tunable's to play with.  No need to be coding at all...

 

I've search the forum and other documentation for info about these and came up with nothing. Do you know of any references?

 

I'd appreciate it if you could tell me which (if any) of the earlier settings you suggested tweaking is controlled by the "Force NCQ disabled" setting on the unRAID Settings page?

 

Thanks,

- Eric

Link to comment

for i in /sys/block/[sh]d*/queue/nr_requests

do

echo 5000 >$i  

done

 

I did some tests with different nr_requests values.  Turns out, it makes no difference on unRAID.

 

As unRAID doesn't support any hardware raid cards, the highest value for device/queue_depth we'll ever see is 31.

So increasing the nr_requests to anything higher than the default 128 doesn't bring any improvement whatsoever.

 

Purko

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.