April 21, 200818 yr So what do you all make of this? I am having easily reproducable performance problems writing from a vista machine to my unraid server. Write performance throughput will be good (10-30MB/sec) for a few seconds and then stall for some amt of time before resuming. The stalls last anywhere from less than a second to 30+ seconds at a time. Suspecting networking problems I took an ethereal/wireshark trace from the vista machine. (I've attached a section which shows a 49 second long stall, I find it easiest to read using the 'smb' display filter) No packets seem to be getting lost at all. The vista writer (x.x.x.107) seems to be writing SMB chunks (32k each) rather quickly and the unraid server (x.x.x.118) seems to be acking them pretty quickly (more or less) when all of a sudden (starting at frame 758) the unraid box just stops acking the SMB write operations. The vista machine gets 16 write operations ahead of the file server and then seems to wait. 36.73 seconds after not hearing from the unraid machine, the vista machine sends an smb echo request (at frame 1116). 11.95 more seconds go by before the unraid machine snaps out of it and acks the 16 write jobs and finally the echo request (frames 1118 through 1150). I had smbd logging level 4 turned on during this experiment and have lined this trace up with the syslog to see what I could see. Please note that the two machines clocks are actually 3 seconds off, in case you look at the pcap file using non-relative timing. No errors seem to be in the log, just a big gap. There is no mention of the echo request. Finally, if you look at frame 1115 in the pcap, you will see that it is a tcp ack coming from the unraid machine acknowleding the data sent up through frame 1114. IOW - the low level network stack on the unraid machine has seen and acked all the data sent by the vista machine while it was stacking up those 16 jobs. 36 seconds later, when the echo request comes, the unraid machine acks the tcp data in 145 microseconds (frame 1117) but still does not respond at the SMB level. It really feels like something wierd is causing the unraid box to stall for a little bit, but there is not real indication of anything wrong in the logs. CPU load during the event never goes above 25-30%. Both the parity drive and the drive being written two are nearly brand new WD 1TB drives. Read operations seem pretty smooth, no stutters or anything. Anyone have any thoughts as to what data I should collect next? I'm pretty stumped right now. below is a link to the pcap file, it at 2MB it was too large to attach on these forums. write_stall.zip here is the relevent section of my syslog. Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:switch_message(926) Apr 20 19:14:30 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:14:30 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:process_smb(1068) Apr 20 19:14:30 mrpink smbd[3306]: Transaction 3087 of length 32836 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:switch_message(926) Apr 20 19:14:30 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:14:30 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:process_smb(1068) Apr 20 19:14:30 mrpink smbd[3306]: Transaction 3088 of length 32836 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:switch_message(926) Apr 20 19:14:30 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:14:30 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:process_smb(1068) Apr 20 19:14:30 mrpink smbd[3306]: Transaction 3089 of length 32836 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:switch_message(926) Apr 20 19:14:30 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:14:30 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:process_smb(1068) Apr 20 19:14:30 mrpink smbd[3306]: Transaction 3090 of length 32836 Apr 20 19:14:30 mrpink smbd[3306]: [2008/04/20 19:14:30, 3] smbd/process.c:switch_message(926) Apr 20 19:14:30 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 --- NOTE THE GAP --- Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:15:19 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:process_smb(1068) Apr 20 19:15:19 mrpink smbd[3306]: Transaction 3091 of length 32836 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:switch_message(926) Apr 20 19:15:19 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:15:19 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:process_smb(1068) Apr 20 19:15:19 mrpink smbd[3306]: Transaction 3092 of length 32836 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:switch_message(926) Apr 20 19:15:19 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:15:19 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:process_smb(1068) Apr 20 19:15:19 mrpink smbd[3306]: Transaction 3093 of length 32836 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:switch_message(926) Apr 20 19:15:19 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/reply.c:reply_write_and_X(3154) Apr 20 19:15:19 mrpink smbd[3306]: writeX fnum=12752 num=32768 wrote=32768 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:process_smb(1068) Apr 20 19:15:19 mrpink smbd[3306]: Transaction 3094 of length 32836 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/process.c:switch_message(926) Apr 20 19:15:19 mrpink smbd[3306]: switch message SMBwriteX (pid 3306) conn 0x803f0350 Apr 20 19:15:19 mrpink smbd[3306]: [2008/04/20 19:15:19, 3] smbd/reply.c:reply_write_and_X(3154)
April 21, 200818 yr Author some additional info: So I took user shares out of the picture and have started to test using disk shares. No change in behavior, but I'm trying to keep things simple. I'm testing using my new disk (disk9). I can read large files at a sustained 50MB/sec with no trouble. When I write to the same disk I start to see a stuttering pattern. When I copy a large file, the initial transfer rate will burst up to 50MB/sec or so, but it will only hold that rate for a second at the most. Network activity will then basically cut out to 0 for a long time. Later, the transfer will start to move in little bursts, or sometimes there will be a larger bust, more like the initial one. CPU load on the NAS is about 35% when the burst is happening, but drops back down to almost nothing when the burst stops. Top reports 906.4MB of RAM, with 9MB free and ~840MB under Swap: cached. This is starting to feel like there is a buffer somewhere in the unraid system which is filling up and causing (probably indirectly) smbd to delay its write acks until the buffer has been drained. Judging by the bursts I am seeing, it seems to be something like 50-100MB in size. Whenever the buffer does fill, it takes a very long time to recover, either because it is taking a long time for the buffer to drain or because there is a glitch somewhere when starting the flow up again. Working under this theory, I throttled back my send side by forcing my sending NIC down to 100mbps and then 10mbps. At 100mbps, the initial burst was much longer, 5-6 seconds at least and the drop outs lasted less time in general as well. At 10mbps, the system seemed to keep up without any trouble, the transfer rate maxed out and stayed that way for the duration. Does anyone with a good low level technical knowledge of unraid have an idea of where my bottleneck might be or how I can collect more data to help me isolate the problem?
April 21, 200818 yr In this thread http://lime-technology.com/forum/index.php?topic=1530.msg10308#msg10308 I think I had isolated the issue somewhat to "pdflush" blocking all network I/O. Check out that thread to read what I found with the interaction with "pdflush" It might give you an idea or two of something you can check based on your knowledge. It might even be file-system-type specific with some kind of locking present in reiserfs that is not present in other file system types. Some things to try. Un-assign the parity drive. (eliminates it from being a bottleneck) copy system.dat beforehand so you can put things back without having to completely re-calculate parity. If you still see the periodic blocking, then parity has nothing to do with the issue. Try a different file-system type. See my entry in the wiki about mounting an NTFS drive as read/write and sharing it via samba. If writing to it is smooth, then perhaps Tom should investigate using (or allowing) something other than reiserfs. Joe L.
April 21, 200818 yr where my bottleneck might be or how I can collect more data to help me isolate the problem? You haven't found a problem unique to your system, but a common systemic issue with writing to an unRAID parity-protected data drive. We've all had to work around it. It seems to have to do with the parity update penalty, that while reading from a data drive is just a single read, a write to a parity-protected drive involves a read of both the data and parity drives, followed by a write to both the data and parity drives. That's 4 disk I/O's instead of 1 I/O. There have been several threads concerning this, the one mentioned above by Joe, and another older one linked in that thread, and there are others too. Joe has done some of the most interesting analysis concerning it. I read from my unRAID server at close to 30MB/s, and write to it at about 10MB/s, averaged over the spikes. I can throttle the transfer to the server in Total Commander, and if I set the speed to a little less than 10MB/s, say about 9000KB/s, I get a very even flow. My experiences have led me to about the same conclusions, that there is a buffer that is not draining fast enough to keep up with the network inflow. I have assumed it was related to the extra disk I/O for parity update. If a parity drive is not in use, then writes perform at normal speeds. This is why Tom has added the new Cache drive feature in the latest versions, so writes can postpone the parity penalty.
April 21, 200818 yr where my bottleneck might be or how I can collect more data to help me isolate the problem? You haven't found a problem unique to your system, but a common systemic issue with writing to an unRAID parity-protected data drive. We've all had to work around it. It seems to have to do with the parity update penalty, that while reading from a data drive is just a single read, a write to a parity-protected drive involves a read of both the data and parity drives, followed by a write to both the data and parity drives. That's 4 disk I/O's instead of 1 I/O. There have been several threads concerning this, the one mentioned above by Joe, and another older one linked in that thread, and there are others too. Joe has done some of the most interesting analysis concerning it. I'm not convinced it is the parity drive I/O itself that is the issue. It would not explain the multi-second pauses. There is something blocking the normal network I/O. As I said, I suspect that pdflush is doing an exclusive lock of some kind that unintentionally causes the network I/O to be blocked as well. Now, it might just be blocking at the network level after initially filling the disk I/O buffer. In either case, until we can learn where the blocked I/O is occurring we cannot truly address the underlying issue. Tom's cache drive is a work-around. It is simply another "buffer" that is a bit harder to fill. I wonder, if it has a lot of files to transfer to the parity protected drives, and if you manually initiate a move, will its throughput have the same start/stop buffering as it does the copy across the network? Another test...Initiate a manual copy from a large file on the cache drive to the parity protected disks, does network traffic then also stop/start and occur in bursts? My experiment was to set the nice value of the existing pdflush processes slightly lower to allow other I/O to occur. Tom warned about possible side effects, and anything I did was only to the two or so existing pdflush processes... (they die and get re-spawned as they are used to drain the buffer pool to the disks) Interestingly, I read that the newest kernel, just released, has a new set of features to track just this type of thing... you can more easily monitor where processes are spending their time. Perhaps when Tom migrates to it, we will have an easier time. Until then, kernel tracing and debugging anyone? Joe L.
April 21, 200818 yr I'm not convinced it is the parity drive I/O itself that is the issue. It would not explain the multi-second pauses. There is something blocking the normal network I/O. I think you are right. Plus, John did not see any NAK's above, just bottled up ACK's (if I remember my TCP stuff correctly). That seems consistent with what you are saying.
April 21, 200818 yr Both of these describe pdflush lock contention with reiserfs. Interesting reading. http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf http://lists.opensuse.org/opensuse-factory/2006-09/msg00542.html Joe L.
April 22, 200818 yr Author I'm not convinced it is the parity drive I/O itself that is the issue. It would not explain the multi-second pauses. There is something blocking the normal network I/O. I think you are right. Plus, John did not see any NAK's above, just bottled up ACK's (if I remember my TCP stuff correctly). That seems consistent with what you are saying. Just to be perfectly accurate, the network trace I posed shows the unraid machine's tcp/ip stack acking the received data almost instantly. The delay comes one protocol level up at the SMB level where it takes a large amt of time to acknowledge the write jobs. This implies that neither the NIC driver, nor the TCP/IP stack is actually being held off. Rather, smbd seems to be held up waiting for its write jobs to drain through the disk subsystem. I'm not convinced it is the parity drive I/O itself that is the issue. It would not explain the multi-second pauses. There is something blocking the normal network I/O. As I said, I suspect that pdflush is doing an exclusive lock of some kind that unintentionally causes the network I/O to be blocked as well. Now, it might just be blocking at the network level after initially filling the disk I/O buffer. In either case, until we can learn where the blocked I/O is occurring we cannot truly address the underlying issue. Just as a quick update, I have been following up on your suggestion to look into pdflush's behavior. This does seem to be the general source of the stalling, although it seems hard to isolate exact nature. I also upgraded to 4.3.beta6 just in case there were some linux behavior issues which might have been cleared up by the new kernel. As far as I can tell, the only think which has changed with the kernel change is that the kernel now recognizes both gigs of RAM I have in the system instead of just one. I have been running tests doing file copies to the array over the network and watching /proc/meminfo using 'watch -n 0.3 /proc/meminfo' as I do so. In particular, I am paying attn to the Dirty and the Writeback fields. When a large file transfer starts, the amt of dirty data increases quite quickly as the network fills up the cache. At this point in time, I would expect the pdflush system to drain the data from the dirty cache just about as fast as the unraid disk block system can drain the data. In general, it does seem to do this. Writing to the fastest disk in my system, I seem to be able to achieve just about 13-15 MB/sec write throughput. What I have discovered, however, is that my newest disk addition to my array is not performing nearly as quickly. It is achieving something more like 4-6 MB/sec on write, even though its read is nice and quick. The interesting thing here is that these are all identical disks. The parity drive, disk1 (the fast one) and disk9 (the slow one) are all 1TB western digital drives with identical firmware. Device Model: WDC WD10EACS-00ZJB0 Firmware Version: 01.01B01 hdparm -v seems to indicate that the drives are all using identical IDE settings. smartctrl seems to indicate that the drives are all perfectly happy. Running the following test, however, show large difference in write performance. time `dd if=/dev/zero of=/dev/diskXXX/zerofile bs=1K count = 400K ; sync` where XXX is the number of the drive you want to test. This will basically write 400MB of zeros to disk and then sync the page cache, timing how long the whole operation takes in the process. Adding the sync to the end makes certain we time how long it takes for the data to make it all the way to the disk, and not just how long it took to fill up the page cache. running this on the two disks (disk1 is the fast one, disk9 is the slow one) gives me root@mrpink:~# time `dd if=/dev/zero of=/mnt/disk1/zerofile bs=1K count=400K ; sync` 409600+0 records in 409600+0 records out 419430400 bytes (419 MB) copied, 3.65931 s, 115 MB/s real 0m29.982s user 0m0.230s sys 0m3.300s root@mrpink:~# time `dd if=/dev/zero of=/mnt/disk9/zerofile bs=1K count=400K ; sync` 409600+0 records in 409600+0 records out 419430400 bytes (419 MB) copied, 3.1262 s, 134 MB/s real 1m0.097s user 0m0.170s sys 0m2.910s root@mrpink:~# Notice that it takes pretty much 2 times as long on disk9 to do the writes as it did on disk1. Its this lack of write throughput which produces the jaggy network performance. I have played with the vm sysctl settings to the point where have pdflush buffering a lot of data and being agressive about flushing it, but fundamentally it seems to be this drive's slow writes which are at the heart of the problem. I'm not really certain where to go next with this... I suppose bypassing the parity drive as suggested by Joe earlier might provide some insight into the situation. I may also try to move the drive to a different port on the controller. Currently it is hanging off port 6 of one of those promise TX2 8 port SATA-II cards while the parity and quick drives are hanging off ports 0 and 1. I could swap the fast and the slow drives to see if the port makes any difference, or I could move the slow drive to one of the motherboard's SATA ports (neither is being used right now) and see if that makes any difference either.
April 22, 200818 yr Just as a quick update, I have been following up on your suggestion to look into pdflush's behavior. <snip> I have been running tests doing file copies to the array over the network and watching /proc/meminfo using 'watch -n 0.3 /proc/meminfo' as I do so. In particular, I am paying attn to the Dirty and the Writeback fields. When a large file transfer starts, the amt of dirty data increases quite quickly as the network fills up the cache. At this point in time, I would expect the pdflush system to drain the data from the dirty cache just about as fast as the unraid disk block system can drain the data. In general, it does seem to do this. Writing to the fastest disk in my system, I seem to be able to achieve just about 13-15 MB/sec write throughput. What I have discovered, however, is that my newest disk addition to my array is not performing nearly as quickly. It is achieving something more like 4-6 MB/sec on write, even though its read is nice and quick. Interesting about your widely different performance with identical disks. I'd suspect the port they are hanging on. As far as pdflush, I would expect it to empty the buffers to the physical disks as fast as they can... BUT I would not expect all samba I/O to stop until pdflush is through writing all the buffers. That is what I suspect is happening. As an experiment, I did a re-nice of the pdflush process to "1" (normally they are at nice "0") and this seemed to smooth things out for a couple of seconds. Smbd could then get a bit of the CPU's time. Of course, as soon as pdflush had finished draining the buffer it dies and was replaced with a new pdflush at nice=0 and performance was choppy again. Basically, pdflush should be a "background" process... not one that has the highest priority... at least not until the buffer cache is full. Joe L.
April 22, 200818 yr Author Interesting about your widely different performance with identical disks. I'd suspect the port they are hanging on. As far as pdflush, I would expect it to empty the buffers to the physical disks as fast as they can... BUT I would not expect all samba I/O to stop until pdflush is through writing all the buffers. That is what I suspect is happening. As an experiment, I did a re-nice of the pdflush process to "1" (normally they are at nice "0") and this seemed to smooth things out for a couple of seconds. Smbd could then get a bit of the CPU's time. Of course, as soon as pdflush had finished draining the buffer it dies and was replaced with a new pdflush at nice=0 and performance was choppy again. Basically, pdflush should be a "background" process... not one that has the highest priority... at least not until the buffer cache is full. Joe L. Yeah, I'm going to bed now, but I'm going to try moving the slow disk to a different SATA port in the morning to see if it helps. So far, I have broken this down into 2 different problems. 1) My new disk and its slow writes. After spending even more time with it, I can say that sometimes its slow and sometimes fast, but usually more slow than fast. When it is slow, it makes the next problem worse. 2) When the page cache buffers fill up to more than the amt indicated by vm.dirty_ratio, jobs back up all the way to the samba level and network activity drops. WRT #2 - I'm not certain what is up with this yet, but I am pretty sure that this is not a situation of being CPU bound. I have tried to lower the pdflush priority but I have not really noticed and change in behavior. I've been watching with vmstat, and when the page buffers fill up and the network dies, the interrupt count does way down as does the number of context switches. CPU time shows as being almost entirely idle, none waiting on I/O, none in user code and very little (a few percent) in kernel code. If a process or thread wants to run here, it should be getting all the time it needs. This feels more like something is holding onto a lock for some reason while it waits a long time for something else to happen. One possibility (although I have to make some pretty big guesses here) is that when the page buffer fills up, smbd's next write job goes directly to the disk device and bypasses the page buffer cache. Now. smbd and all of the instances of pdflush are waiting for their disk jobs to complete. If they are holding a lock on the disk device (/dev/mdXXX in this case) while doing their write, then when they release the lock, it will be a free for all so see who can get the lock first (smbd or one of the pdflushs) after it is released. If the pdflush processes keep winning, then the smbd job will stall and the network will back up until it can retire its write job. By the time it can retire the job, the page buffer has drained some and the next burst of writes will fill it back up again. Depending on the exact nature of the lock (spin lock, kernel assisted semaphore, etc... Hard to say since I don't even know if it exists, I'm just guessing) changing the relative process priority for pdflush and smbd could cause smbd to win the lock contest each time. If this were the case, you should see the page buffer fill up, and then most of the jobs come straight from smbd and bypass the page buffer. The only time pdflush would get a chance to do some IO would be when smbd went back to drain the socket buffers. Does anyone know where I can get a binary for the blktrace package which will work in the unraid 4.3.beta6 environment? I have been reading some about its capabilities and it seems like it would be a good tool to use, both to investigate my slow disk writes as well as to see what exactly is going on when pdflush and smbd are competing for scarce IO resources.
April 22, 200818 yr Author ok, now I'm really going nuts. I picked up the slackware package for sysstat and started to poke around a little with iostat (part of the package). I found that the fast 1TB drives were executing jobs in about 3-5mSec, while the slow drive was taking more like 20-30 and sometimes even 50mSec (the svctm column given by iostat -x). After running a bunch of different tests, I finally shut down the system and swapped the slow and fast drives (I'm using 5in3 front loading drive bays, so it is easy to do). They are identical drives, so if the problem was the port/cable/etc it should stay there. After bringing the box back up and remapping the switched disks with the http gui I ran my tests again. This time, both disks were fast (as was the parity drive, which is also identical to the other 2 and has always been fast). I wondered if the slow drive had not been plugged in all the way, so I shut the system down again and switched the disks back to the way they were. Now the slow drive is slow again. Shutdown again, this time move the slow drive to the mobo's second SATA port (it only has two, I had never used either). Bring the system back up and retest, fast drive is still fast (it hadn't moved) and the slow drive is fast again. Interestingly enough, the job sizes making it to the drive seem to be larger now some of the time resulting in a longer svctm sometimes although the throughput (easily examined with iostat -m) is still quite quick at 12-15MB/sec. I don't know what it is about this disk and this SATA port, but one thing is for sure. When your disk is draining the page buffer at 15MB/sec instead of 5, everything else seems to get a lot better. Its only when the page buffer cache gets saturated that you start to see uneven network performance, and even then it dosn't seem to result in samba ignoring the network for 30-50 seconds at a time. I'm tired this time I really am going to bed
Archived
This topic is now archived and is closed to further replies.