January 6, 201115 yr Are you up to discussing the general idea about this feature? I'm interested in what or how you think the speeds could be improved from a conceptional level. (Driver Rewrite)
January 6, 201115 yr Are you up to discussing the general idea about this feature? I'm interested in what or how you think the speeds could be improved from a conceptional level. (Driver Rewrite) How familiar are you with the md-driver code & how it works?
January 7, 201115 yr Author It's been a while since I looked at it, but from what I can remember... Most of the interesting aspects of scheduling and dealing with IO is in unraid.c. The last rewrite you put into place a series of stripes that can be queued up that specify what's to be done, purely reading or purely writing or syncing for possibly read and write (reconstruction or parity error?). I think it was managed via two overall lists for reading and writing while there were two other lists for active and inactive stripes. I thought I remember seeing one buffer per disk that are larger than a single sector where you manage the need for keeping buffer/cache coherency based on it being uptodate and locked. I think it used one lock per active stripe as well as a device lock. It seemed like things were done in a FIFO manner. The array management aspects and the larger overall parity check functions are within md.c or at least provide a portion of it while the functions are invoked from emhttp. Obviously there are far more details with several nuances involved. I don't recall if or how many threads are invoked for performing the work outside of the ones that show up as 'unraidd' and 'mdrecoveryd'.
January 7, 201115 yr It's been a while since I looked at it, but from what I can remember... Most of the interesting aspects of scheduling and dealing with IO is in unraid.c. The last rewrite you put into place a series of stripes that can be queued up that specify what's to be done, purely reading or purely writing or syncing for possibly read and write (reconstruction or parity error?). I think it was managed via two overall lists for reading and writing while there were two other lists for active and inactive stripes. I thought I remember seeing one buffer per disk that are larger than a single sector where you manage the need for keeping buffer/cache coherency based on it being uptodate and locked. I think it used one lock per active stripe as well as a device lock. It seemed like things were done in a FIFO manner. The array management aspects and the larger overall parity check functions are within md.c or at least provide a portion of it while the functions are invoked from emhttp. Obviously there are far more details with several nuances involved. I don't recall if or how many threads are invoked for performing the work outside of the ones that show up as 'unraidd' and 'mdrecoveryd'. Ok, good enough. Right, the changes would be in unraid.c. This was originally adapted from linux md raid5.c, and the code is oriented toward raid5. In particular it is designed around the concept of a "stripe". Each stripe represents the set of 4K blocks at a common address on all the drives in the array. This driver thus issues I/O in 4K units and relies on the underlying disk drivers to coalesce the small I/O and send larger ones down to the drives (which it does). But then all the synchronization is done on a 4K unit as well. This causes unnecessary delays in read/modify/write required to update parity. I have another idea in mind that would streamline this and would result in fewer disk revolutions lost which means greater write performance. But... it's one of those "if it works, don't break it" kind of projects. But when it comes time to implement P+Q redundancy, I might look at this again.
January 7, 201115 yr Author Interesting information, thanks for sharing. It sure sounds like a larger back-burner kind of project indeed.
January 7, 201115 yr Have you done any testing with non-rotating media (i.e. SSD) to test the hypothesis that rotational latency has any significant impact on performance? The reason I ask is that with large caches and full-track buffering, rotational latency becomes less of an issue. For testing, I used cheap MTRON SSDs that have transfer rates on par with rotational media, but obviously no rotational or seek delays. I got no significant improvement in parity-protected writes. I also did some back-of-the-napkin calcs on theoretical max throughput on parity-protected writes, (read-then-write) which is roughly (max-sequential-read-rate + max-sequential-write-rate)/4 and with the last kernel update, unRAID is getting close to that theoretical max. So is there any bang for the buck in a rewrite? As for P+Q, are you still considering R/S ? If so, how do you update parity w/o spinning up all disks to calc the polynomial?
January 8, 201115 yr Author I also did some back-of-the-napkin calcs on theoretical max throughput on parity-protected writes, (read-then-write) which is roughly (max-sequential-read-rate + max-sequential-write-rate)/4 and with the last kernel update, unRAID is getting close to that theoretical max. Shouldn't the system be able to handle parallel operations on 2 different drives? If so then the actual back of the napkin calculations become (max-sequential-read-rate + max-sequential-write-rate)/2N where N is some factor of latency introduced by doing parallel operations ranging between 1 and 2, hopefully on the lower end of 1.x?
January 8, 201115 yr Shouldn't the system be able to handle parallel operations on 2 different drives? It makes no difference. The theoretical max is based on ideal conditions or the I/O for one drive.... wither parity or the drive you are writing to. Parallel ops to the same drive won't help you unless you played some tricks with multithreading and drive I/O at the driver level was crap (like much slower in returning results that the drive actually processed the data). If you were writing to 2 data drives at the same time, both are using the same parity drive, so parity is the bottleneck, so the max throughput of parity will be the theoretical limit. What you are talking about is applicable to striped RAID, where you split the same data among multiple disks. With unRAID, it would only happen if you were writing to the same sectors on two different data disk at the same time... and how likely is that? I wish there was some way to look at cache hits and REAL disk writes behind the drives HW cache... that's what you really need to tweak the MD driver, but even if you had that, the tweaking might actually make performance worse on another make/model of drive.
January 8, 201115 yr Author I mean on the data drive and the parity drive. Why can't the 2 reads be done in parallel and why cant the 2 writes be done in parallel? That is where you got the number 4 from having all the IO operations being done in lock-step one after the other, correct? I'm saying at time X: 1a) Read Data Drive 1b) Read Parity Drive Then after that time, X+1 2a) Write Data Drive 2b) Write Parity Drive Given perfect scaling to do those operations at the same time, wouldn't the formula for that be (max-sequential-read-rate + max-sequential-write-rate)/2 and surely not divide by 4?
January 8, 201115 yr I mean on the data drive and the parity drive. Why can't the 2 reads be done in parallel and why cant the 2 writes be done in parallel? I'm already assuming they will be done in parallel! That is where you got the number 4 from having all the IO operations being done in lock-step one after the other, correct? Nope. It is an approximation, that is good when read and write speeds are within 50% of each other. You want to write 100MB. You have a drive that reads at 100 MB/sec and writes at 75MB/sec. To do an unRAID OP, you have to read 100MB (takes 1 sec) and write 100MB (takes 1.33 sec) so the total time is 2.33 sec. to do an unRAID OP on 100MB. 42.9 MB/sec That assumes both the data and parity OPs take place in perfect parallel.
January 8, 201115 yr Author Ah okay, thanks for explaining it. I was mistaken on what 4 referred to, I was unaware it was an empirically derived. The rest of your math makes perfect sense.
January 8, 201115 yr Consider this problem: Consider a slight hill, where the road going up is 1 mile long, and the road going down the other side is also 1 mile long. You do exactly 30 MPH going up the hill. What speed do you have to do going down the other side to average 60 MPH for the whole trip?
January 8, 201115 yr To get an idea of the issues involved in read/modify/write on rotating storage, we can simplify. Task 1: Suppose your task is to write a sequential series of 512Byte sectors with these rules: a) you are going to write one sector at a time, no combining a series of 512B writes into a single longer write. b) before you write the sector, you have to first read it c) no caching or buffering allowed, no disk read ahead or write behind d) the disk spins at 7200RPM If you start writing a disk using these rules, what is the maximum rate you will be writing at? Extra credit: if the aerial density of the disk increases, does the transfer rate increase? For fun: what's the rate if the sector size is 4K? Hey this might make a good interview question
January 8, 201115 yr Hey Boss, you must give me the number of sectors per track to give the precise answer And for the 4K part of the question, you must give us the interleave. Also what state is the disk in... there is latency waiting for the right sector to come under the head to read it.... unless the initial state is given, all we can do is assume the target sector is at the read head at t=0. And an increase in raw areal density does not change the write time of 1 sector... unless it has a concomitant increase in SPT (which it logically does, but not always). Increase the SPT and you will shave some time on the write. But it can slow multiple read/write ops if you have an interleave too low so the overhead of the next read delays things enough to have to wait for another revolution. But that's why drives do track buffering.
January 8, 201115 yr Hey Boss, you must give me the number of sectors per track to give the precise answer And for the 4K part of the question, you must give us the interleave. Also what state is the disk in... there is latency waiting for the right sector to come under the head to read it.... unless the initial state is given, all we can do is assume the target sector is at the read head at t=0. And an increase in raw areal density does not change the write time of 1 sector... unless it has a concomitant increase in SPT (which it logically does, but not always). Increase the SPT and you will shave some time on the write. But it can slow multiple read/write ops if you have an interleave too low so the overhead of the next read delays things enough to have to wait for another revolution. But that's why drives do track buffering. I'm interested in throughput as you start at sector 0 and march through the drive sector-by-sector. All the information needed to calculate the transfer rate is given. As for interleave, I guess you're talking about track-to-track interleave? If so assume zero head switch time.
January 8, 201115 yr To get an idea of the issues involved in read/modify/write on rotating storage, we can simplify. Task 1: Suppose your task is to write a sequential series of 512Byte sectors with these rules: a) you are going to write one sector at a time, no combining a series of 512B writes into a single longer write. b) before you write the sector, you have to first read it c) no caching or buffering allowed, no disk read ahead or write behind d) the disk spins at 7200RPM If you start writing a disk using these rules, what is the maximum rate you will be writing at? Extra credit: if the aerial density of the disk increases, does the transfer rate increase? For fun: what's the rate if the sector size is 4K? Hey this might make a good interview question Okay - so it takes 2 revolutions to do a 512b write (one to read and one to write). At 7200RPM (120RPS), that means you are writing 120/2 * 0.5K = 30 KB/sec = 1.8MB/min. But you can do better if you are writing a long series of sequential sectors. After the write, assume you can instantly do the next read. That means that you are really able to do a write on every revolution. At 7200RPM (120RPS), that means you are writing 120 * 0.5K = 60 KB/sec = 3.6 MB/sec Higher aerial density does not help, as this calculation already assumes infinite aerial density. In the real world, you would lose one write for every average number of sectors per track. So if there are 200 sectors per track, you'd lose 1 write in 200 or about 0.5%. Doubling the aerial density would mean you'd only lose 1 write in 400 or about 0.25%. This is a far cry from the performance boost of 2x you'd get in normal sequential write performance. (So rotational speed is much more important to unRAID performance than aerial density. Hence the Hitachi drives with 7200 RPM speed but lower aerial density (compared to much more expensive WD black, for example) are a pretty good combination for good unRAID write performance). Now if the sector size is 4K, performance would increase. At 7200RPM (120RPS), you are writing 120 * 4K = 480KB/sec = 28.8 MB/min Compare that to regular write performance, which would yield infinite speed at infinite aerial density. Now if the drive is spinning at 5400 RPM, the maximums decrease. With 512 byte sectors ... At 5400RPM (90RPS), you are writing 90 * 0.5K = 45 KB/sec = 2.7 MB/sec With 4K byte sectors ... At 5400RPM (90RPS), you are writing 90 * 4K = 360 KB/sec = 21.6 MB/sec Hope my math is right.
January 8, 201115 yr Okay - so it takes 2 revolutions to do a 512b write (one to read and one to write). At 7200RPM (120RPS), that means you are writing 120/2 * 0.5K = 30 KB/sec = 1.8MB/min. Correct! You're hired! It's 30720 Bytes/sec to be precise, and it doesn't matter how dense the data is (that is the burst rate). The rate is entirely determined by the rotational rate of the drive. So it's easy to understand that on the read pass, after you've read the sector, you have to wait for the sector to rotate underneath the r/w head until you can start writing. But then you might think, "Ok, I've written sector N, can't I immediately read sector N+1." Unfortunately you can't because by the time the disk r/w head switches from write mode to read mode, the next sector has already rotated past. But you can do better if you are writing a long series of sequential sectors. After the write, assume you can instantly do the next read. That means that you are really able to do a write on every revolution. See above, you can't switch instantly from writing one sector to reading the next (or from reading one sector to writing the next). Ok, the point of this little exercise is to illustrate how the geometry of the disk, and the fact that it's rotating, dominates any kind of design in a read/modify/write scenario. (Which is also why Advanced Format drives really suffer write performance penalty when you write data to them that is not 4K-aligned). Exercise 2: Using the same rules as exercise 1, clearly if we write 2 sectors in this manner we would double the throughput. By increasing the number of sectors we read-then-write at a time, what is the maximum throughput we can achieve, and when does increasing the sector count reach a point of "diminishing returns", and why is that? Another assumption: assume all tracks store the same number of sectors.
January 8, 201115 yr Hey Boss, you must give me the number of sectors per track to give the precise answer And for the 4K part of the question, you must give us the interleave. Also what state is the disk in... there is latency waiting for the right sector to come under the head to read it.... unless the initial state is given, all we can do is assume the target sector is at the read head at t=0. True, but that assumption is only good for the first 512 byte sector being written. To read the next we must either assume the next sector to be read/subsequently written is in the on-disk cache, in which case t=0, or the disk must rotate to get to it, in which case the interleave of sectors comes into play. If immediately adjacent the disk might have rotated past where the read can be accomplished. An additional rotation might be needed. If halfway across the disk, then 1/2 rotation is needed between every sector being written. And an increase in raw areal density does not change the write time of 1 sector... unless it has a concomitant increase in SPT (which it logically does, but not always). Increase the SPT and you will shave some time on the write. But it can slow multiple read/write ops if you have an interleave too low so the overhead of the next read delays things enough to have to wait for another revolution. But that's why drives do track buffering. True... but the real question is are full tracks buffered? I'm not sure this is possible with most drives these days. I'm interested in throughput as you start at sector 0 and march through the drive sector-by-sector. All the information needed to calculate the transfer rate is given. Respectfully, it is , but only is the entire track is buffered and you assume no delay in reading the subsequent sector immediately after writing the prior one.
January 8, 201115 yr I'm interested in throughput as you start at sector 0 and march through the drive sector-by-sector. All the information needed to calculate the transfer rate is given. Not really. You are making a bunch of assumptions. As I said, there is not enough information to give a precise result. 7200rpm is 8.3333 ms/revolution. If you have 40 sectors per track (including spares), then it takes 0.2083 ms to read sector 0. Then it takes 8.125 ms waiting for the drive to spin back to sector 0 for the write. TOTAL: 8.3333 ms. Then you ASSUME that during the rotational latency, you have time to prepare what you want to write to sector 0 and all necessary drive overhead is completed. Writing sector 0 takes 0.2083 ms. TOTAL: 8.5417 ms Or 1 +1/SPT revolutions. I know you want to illustrate that the system can not then read sector 1, because it comes too soon, but that is based on a host of assumptions. You ASSUME that you have a 1:1 interleave, or if you are changing heads, no sector offset to the next head. So with your assumptions, there is rotational latency after the write of a full track 40 sectors (skip the next 39 on the track, then also skip sector 0, then read sector 1). But now you are at the end of sector 1 (having just written it. So total rotations is 2+1/SPT... not 2.... and again assuming 1:1 interleave.... 16.875 ms to do the OP on 1 sector, and be ready for the next. 30,340 bytes per second. The generic formula where N is the number of sequential sectors to be written: (2N +((N-1)/SPT)) * 8.33333 ms (for a 7200 RPM drive and any sector size) Note that areal density and head improvements can mean more (narrower) tracks, but fewer SPT.
January 8, 201115 yr when does increasing the sector count reach a point of "diminishing returns", and why is that? The sweet spot is when the time needed to switch the head mode is equal to your latency. So if the head switch take 11 ms (this is a REALLY crappy controller, since typically you see head switches in the 1 to 2 ms or less) , that is 1.32 sectors of rotation on a 40 SPT drive, so your block size should be 38 sectors (SPT-2) since 1.32 sectors (floor it up to 2) of rotation have to happen for the switch. Any integer factor of SPT-2 such as 4*SPT-2 will also work. Note that if the head switch time is different from going from read to write, than from write to read, you have to use the larger one. So in real live, with a top end of 2 ms switch time, you only need to lose 1 sector of rotation, until you get to the 160 SPT range.
January 8, 201115 yr I mean on the data drive and the parity drive. Why can't the 2 reads be done in parallel and why cant the 2 writes be done in parallel? I'm already assuming they will be done in parallel! That is where you got the number 4 from having all the IO operations being done in lock-step one after the other, correct? Nope. It is an approximation, that is good when read and write speeds are within 50% of each other. You want to write 100MB. You have a drive that reads at 100 MB/sec and writes at 75MB/sec. To do an unRAID OP, you have to read 100MB (takes 1 sec) and write 100MB (takes 1.33 sec) so the total time is 2.33 sec. to do an unRAID OP on 100MB. 42.9 MB/sec That assumes both the data and parity OPs take place in perfect parallel. Why would a drive write slower than it reads? Rotational speed is the same in either case.
January 8, 201115 yr I mean on the data drive and the parity drive. Why can't the 2 reads be done in parallel and why cant the 2 writes be done in parallel? I'm already assuming they will be done in parallel! That is where you got the number 4 from having all the IO operations being done in lock-step one after the other, correct? Nope. It is an approximation, that is good when read and write speeds are within 50% of each other. You want to write 100MB. You have a drive that reads at 100 MB/sec and writes at 75MB/sec. To do an unRAID OP, you have to read 100MB (takes 1 sec) and write 100MB (takes 1.33 sec) so the total time is 2.33 sec. to do an unRAID OP on 100MB. 42.9 MB/sec That assumes both the data and parity OPs take place in perfect parallel. Why would a drive write slower than it reads? Rotational speed is the same in either case. perhaps reads can take advantage of the buffer cache on the disk... but good question.
January 8, 201115 yr To get an idea of the issues involved in read/modify/write on rotating storage, we can simplify. Task 1: Suppose your task is to write a sequential series of 512Byte sectors with these rules: a) you are going to write one sector at a time, no combining a series of 512B writes into a single longer write. b) before you write the sector, you have to first read it c) no caching or buffering allowed, no disk read ahead or write behind d) the disk spins at 7200RPM If you start writing a disk using these rules, what is the maximum rate you will be writing at? Extra credit: if the aerial density of the disk increases, does the transfer rate increase? For fun: what's the rate if the sector size is 4K? Hey this might make a good interview question Item a) is the worst case. Wouldn't the common case allow sequential writes?
January 8, 201115 yr I think the point of Tom's question is to illustrate how rotational latency plays a role in the unique type of read-the-write-same-sector operation that unRAID does. But if you approach it from the other direction: Assume no rotational latency, assume all writes are giant sequential, unfragmented writes. Assume everything is perfect in terms of efficiency, and you can rip things at the fastest sequential rate the drive has. That analysis tells you what the best result you can hope to get with a drive is. The theoretical max. Now if your real-world implementation (i.e. unRAID) is giving you 40MB/sec of sustained write speed, and the theoretical max for that drive is 43MB/sec, is it worth the resources to try to rewrite the code, when the MOST you can possible get is a measly 8% improvement, and likely not even close to that?
January 9, 201115 yr OK, so the worst case is 30720 Bytes/sec. People are getting about 38MBps. The best case is 50% of the average of sustained read and write speed, which with a new drive should be about 50. Newer drives will have the same worst case because it is based on RPM. But the best case should increase with density. Does the average observed speed increase linearly with drive speed? Even if it does, the difference between observed speed and potential will grow larger as drive speed increases.
Archived
This topic is now archived and is closed to further replies.