unRAID Server release 4.5-beta11 available


limetech

Recommended Posts

But if while media is playing on the Mac Mini, if you browse directories or access another disk that has to be spun up, on another computer, does the Mac Mini media playback pause/stutter?

 

Yes it does. I can't remember a time when it didn't do this. Playing a media file on my Mini and browsing the file structure (and spinning up another disk) on my laptop will often cause the Mini to stutter/pause playback (due to its buffer being drained). Both the mini and the unraid server are connected via GigE.

 

I should note..

1) It only seems to do this when streaming HD content (SD content is fine)

2) Even when streaming HD content, it doesn't ALWAYS happen.

 

 

Link to comment
  • Replies 248
  • Created
  • Last Reply

Top Posters In This Topic

The change I made, oddly enough, was to use a different USB flash drive. I recalled that the only time I had problems with beta 6 negotiating a gibabit connection was when I had the flash drive plugged into a port on a USB controller that shared an IRQ with the primary LAN port. So I thought perhaps the flash drive was marginal (it's an old 512MB CF card in a card reader). Apparently there was something to that theory, since the NIC appears to be working significantly better with the new flash drive installed. Why, exactly, I can only guess. Maybe someone with more knowledge will have a theory.

 

Possibly there are several factors at play causing all this weirdness: marginal chip or circuit on my motherboard, combined with marginal drivers, and who knows what else, perhaps something with the later linux kernels.

 

Interesting, could it be an issue with IRQ routing/sharing. Perhaps review output of 

cat /proc/interrupts

and/or move the USB flash adapter around to different ports.

 

Odd situation.

 

Here's how the interrupts look currently with a eth0 connected at 1000Mb/s:

Tower login: root
Linux 2.6.31.6-unRAID.
root@Tower:~# cat /proc/interrupts
           CPU0       CPU1
  0:       4982          0   IO-APIC-edge      timer
  1:         12         12   IO-APIC-edge      i8042
  6:          0          3   IO-APIC-edge      floppy
  9:          0          0   IO-APIC-fasteoi   acpi
14:          0          0   IO-APIC-edge      ide0
15:          0          0   IO-APIC-edge      ide1
17:          0          0   IO-APIC-fasteoi   ohci_hcd:usb2, sata_sil24
18:          0          0   IO-APIC-fasteoi   ohci_hcd:usb3
19:          0          0   IO-APIC-fasteoi   ohci_hcd:usb4
21:          0          0   IO-APIC-fasteoi   eth1
23:        645       1257   IO-APIC-fasteoi   ehci_hcd:usb1
27:      79275      27062   PCI-MSI-edge      eth0
28:       7684       3170   PCI-MSI-edge      ahci
NMI:          0          0   Non-maskable interrupts
LOC:     703324     703292   Local timer interrupts
SPU:          0          0   Spurious interrupts
CNT:          0          0   Performance counter interrupts
PND:          0          0   Performance pending work
RES:      37566      40155   Rescheduling interrupts
CAL:         50         16   Function call interrupts
TLB:        217        358   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:         25         24   Machine check polls
ERR:          0
MIS:          0
root@Tower:~#

 

Next time I have a problem, I'll check again and compare.

 

Can anyone tell me if linux assigns IRQs itself? I ask because the IRQ assignments are different from what the Asus M/b manual states. It says that usb2 and eth0 share an interrupt, but that is not the case here. If the IRQs are shuffled differently with each start-up that might help explain what's going on.

Link to comment

maybe this could be helped with some form of re-ahead optimization and/or extra ram.

 

I have 4GB ram in my unraid server.  It seems that unraid stops streaming all files, even recently accessed data it should have in its cache, until spin-up completes.  Also, the unraid http server appears to lock during spin-up.

 

I think part of what is needed is determination if it's a shfs or kernel issue.

 

Not exactly sure how to determine this.

 

 

Edit:

Out of curiosity I wrote a script that simply cats a text file on one of my disks over and over on the local unraid server (via telnet).  I spun down, then spun up the drives.   No interruption in the display of the text file was noticed.  Repeated on both user share (/mnt/user) and direct (/mnt/disk1).  So small files, accessed locally on the machine, do not appear to be affected by spin-up.  I'm not sure what to think of this, if anything.

 

Link to comment
Out of curiosity I wrote a script that simply cats a text file on one of my disks over and over on the local unraid server (via telnet).  I spun down, then spun up the drives.  No interruption in the display of the text file was noticed.  Repeated on both user share (/mnt/user) and direct (/mnt/disk1).  So small files, accessed locally on the machine do not appear to be affected by spin-up.  I'm not sure what to think of this.

If you are continually reading the file, it is in the buffer cache.

This is not a good test.

 

do

have two telnet sessions open

cd /mnt/disk1 in one session

cd /mnt/disk2 in another session.

 

echo 3 > /proc/sys/vm/drop_caches

 

on disk 1  ls -l .

after it spins up choose a decent sized file to cat but do not cat it yet.

 

on telnet session with disk 2

do an ls -l . or an ls -lR .

This should cause the disk 2 to spin up.

 

Switch back to disk 1 session

press enter a few times to see if the session hangs or not (probably not).

 

Then do the cat

if it pauses or waits until the other disk spins up, then that is the answer.

 

 

Link to comment

The change I made, oddly enough, was to use a different USB flash drive. I recalled that the only time I had problems with beta 6 negotiating a gibabit connection was when I had the flash drive plugged into a port on a USB controller that shared an IRQ with the primary LAN port. So I thought perhaps the flash drive was marginal (it's an old 512MB CF card in a card reader). Apparently there was something to that theory, since the NIC appears to be working significantly better with the new flash drive installed. Why, exactly, I can only guess. Maybe someone with more knowledge will have a theory.

 

Possibly there are several factors at play causing all this weirdness: marginal chip or circuit on my motherboard, combined with marginal drivers, and who knows what else, perhaps something with the later linux kernels.

 

Interesting, could it be an issue with IRQ routing/sharing. Perhaps review output of 

cat /proc/interrupts

and/or move the USB flash adapter around to different ports.

 

Odd situation.

 

Here's how the interrupts look currently with a eth0 connected at 1000Mb/s:

Tower login: root
Linux 2.6.31.6-unRAID.
root@Tower:~# cat /proc/interrupts
           CPU0       CPU1
  0:       4982          0   IO-APIC-edge      timer
  1:         12         12   IO-APIC-edge      i8042
  6:          0          3   IO-APIC-edge      floppy
  9:          0          0   IO-APIC-fasteoi   acpi
14:          0          0   IO-APIC-edge      ide0
15:          0          0   IO-APIC-edge      ide1
17:          0          0   IO-APIC-fasteoi   ohci_hcd:usb2, sata_sil24
18:          0          0   IO-APIC-fasteoi   ohci_hcd:usb3
19:          0          0   IO-APIC-fasteoi   ohci_hcd:usb4
21:          0          0   IO-APIC-fasteoi   eth1
23:        645       1257   IO-APIC-fasteoi   ehci_hcd:usb1
27:      79275      27062   PCI-MSI-edge      eth0
28:       7684       3170   PCI-MSI-edge      ahci
NMI:          0          0   Non-maskable interrupts
LOC:     703324     703292   Local timer interrupts
SPU:          0          0   Spurious interrupts
CNT:          0          0   Performance counter interrupts
PND:          0          0   Performance pending work
RES:      37566      40155   Rescheduling interrupts
CAL:         50         16   Function call interrupts
TLB:        217        358   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:         25         24   Machine check polls
ERR:          0
MIS:          0
root@Tower:~#

 

Next time I have a problem, I'll check again and compare.

 

Can anyone tell me if linux assigns IRQs itself? I ask because the IRQ assignments are different from what the Asus M/b manual states. It says that usb2 and eth0 share an interrupt, but that is not the case here. If the IRQs are shuffled differently with each start-up that might help explain what's going on.

 

It does not look like a shared IRQ issue.

As far as IRQ's being shuffled, anything above 15 (I believe) is programmed in software

Although I don't know what PCI-MSI-edge is.

 

APIC http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller

 

There are some boot time parameters you can try if inclined.

Try each one individually and in sets to see how you fare.

 

bzroot rootdelay=10 acpi=off noapic nolapic irqpoll

Link to comment

If you are continually reading the file, it is in the buffer cache.

This is not a good test.

 

do

have two telnet sessions open

cd /mnt/disk1 in one session

cd /mnt/disk2 in another session.

 

echo 3 > /proc/sys/vm/drop_caches

 

on disk 1  ls -l .

after it spins up choose a decent sized file to cat but do not cat it yet.

 

on telnet session with disk 2

do an ls -l . or an ls -lR .

This should cause the disk 2 to spin up.

 

Switch back to disk 1 session

press enter a few times to see if the session hangs or not (probably not).

 

Then do the cat

if it pauses or waits until the other disk spins up, then that is the answer.

 

Would adding "echo 3 > /proc/sys/vm/drop_caches" before the cat file in my loop accomplish the same?

 

Link to comment

Out of curiosity I wrote a script that simply cats a text file on one of my disks over and over on the local unraid server (via telnet).  I spun down, then spun up the drives.   No interruption in the display of the text file was noticed.  Repeated on both user share (/mnt/user) and direct (/mnt/disk1).  So small files, accessed locally on the machine do not appear to be affected by spin-up.  I'm not sure what to think of this.

If you are continually reading the file, it is in the buffer cache.

This is not a good test.

 

do

have two telnet sessions open

cd /mnt/disk1 in one session

cd /mnt/disk2 in another session.

 

echo 3 > /proc/sys/vm/drop_caches

 

on disk 1  ls -l .

after it spins up choose a decent sized file to cat but do not cat it yet.

 

on telnet session with disk 2

do an ls -l . or an ls -lR .

This should cause the disk 2 to spin up.

 

Switch back to disk 1 session

press enter a few times to see if the session hangs or not (probably not).

 

Then do the cat

if it pauses or waits until the other disk spins up, then that is the answer.

 

Ok.  I repeated what you describe above, but I added a "sync" before "echo 3 > /proc/sys/vm/drop_caches" to make sure the cache was completely flushed.

 

This time, I do get a delay while another disk is spinning up when trying to cat a large text file off a disk that is already spun up.

 

Link to comment

Exact same issue here, and this is my only true gripe with Unraid. Write performance is nice, but this is extremely annoying. Would love it Tom, if you would be able to investigate it further.

 

Same issue here. I was not sure whether it was an unRaid issue or an issue with my media player, which is also updated several times a week with new builds.

Link to comment

Out of curiosity I wrote a script that simply cats a text file on one of my disks over and over on the local unraid server (via telnet).  I spun down, then spun up the drives.   No interruption in the display of the text file was noticed.  Repeated on both user share (/mnt/user) and direct (/mnt/disk1).  So small files, accessed locally on the machine do not appear to be affected by spin-up.  I'm not sure what to think of this.

If you are continually reading the file, it is in the buffer cache.

This is not a good test.

 

do

have two telnet sessions open

cd /mnt/disk1 in one session

cd /mnt/disk2 in another session.

 

echo 3 > /proc/sys/vm/drop_caches

 

on disk 1  ls -l .

after it spins up choose a decent sized file to cat but do not cat it yet.

 

on telnet session with disk 2

do an ls -l . or an ls -lR .

This should cause the disk 2 to spin up.

 

Switch back to disk 1 session

press enter a few times to see if the session hangs or not (probably not).

 

Then do the cat

if it pauses or waits until the other disk spins up, then that is the answer.

 

Ok.  I repeated what you describe above, but I added a "sync" before "echo 3 > /proc/sys/vm/drop_caches" to make sure the cache was completely flushed.

 

This time, I do get a delay while another disk is spinning up when trying to cat a large text file off a disk that is already spun up.

 

It might be some kind of shared lock used by both the disk driver and the network I/O.

 

While doing a bit of research I found this site describing using hdparm for tuning disk I/O. http://events.oreilly.com/pub/a/linux/2000/06/29/hdparm.html

 

# unmaskirq: Turning this on will allow Linux to unmask other interrupts while processing a disk interrupt.

What does that mean? It lets Linux attend to other interrupt-related tasks (i.e., network traffic) while waiting for your disk to return with the data it asked for. It should improve overall system response time, but be warned: Not all hardware configurations will be able to handle it.

It could easily be that a disk that is being spun up has masked ALL interrupts until it returns with the data requested.... and after it spins up...  Until then, no network traffic.

 

With PATA drives the hdparm command manual show this option

-u Get/set interrupt-unmask flag for the drive. A setting of 1 permits the driver to unmask other interrupts during processing of a disk interrupt, which greatly improves Linux’s responsiveness and eliminates "serial port overrun" errors. Use this feature with caution: some drive/controller combinations do not tolerate the increased I/O latencies possible when this feature is enabled, resulting in massive filesystem corruption. In particular, CMD-640B and RZ1000 (E)IDE interfaces can be unreliable (due to a hardware flaw) when this option is used with kernel versions earlier than 2.0.13. Disabling the IDE prefetch feature of these interfaces (usually a BIOS/CMOS setting) provides a safe fix for the problem for use with earlier kernels. (note the warning is for a much earlier version of linux as we are on 2.6.31.6, and very old hardware disk controllers. Also note the responsiveness they are talking about is with serial ports, way before network I/O was common on Linux, but masking interrupts caused data loss back then too.)

 

If you have PATA drives you can try issuing

hdparm  -u1 /dev/hda

(using of course, instead of "hda", the correct disk devices storing the files you are using for testing)

and see if that makes a difference.

 

I don't think the same command is available for SATA drives... not even sure if they use the same interrupt masking, but this is certainly worth a try if you are using PATA drives.

 

You can see the current interrupt mask mode for your disk with

hdparm /dev/hdb

 

Joe L.

Link to comment

If you have PATA drives you can try issuing

hdparm  -u1 /dev/hda

(using of course, instead of "hda", the correct disk devices storing the files you are using for testing)

and see if that makes a difference.

 

I don't think the same command is available for SATA drives... not even sure if they use the same interrupt masking, but this is certainly worth a try if you are using PATA drives.

 

You can see the current interrupt mask mode for your disk with

hdparm /dev/hdb

 

I only have a single PATA (my cache drive).  I issued a "hdparm  -u1 /dev/hdb" on it, but see no change.  Spin-up of any drive still delays I/O to any file on an already spun-up device (though, if the file is already buffered in ram, it may stream uninterrupted -- but I'm not certain).

 

Link to comment

 

You can see the current interrupt mask mode for your disk with

hdparm /dev/hdb

 

Joe L.

 

I get following response on that command on my sata drives:

 

/dev/sdh:

IO_support    =  1 (32-bit)

HDIO_GET_UNMASKINTR failed: Inappropriate ioctl for device

HDIO_GET_DMA failed: Inappropriate ioctl for device

HDIO_GET_KEEPSETTINGS failed: Inappropriate ioctl for device

readonly      =  0 (off)

readahead     = 256 (on)

geometry      = 56065/255/63, sectors = 1953525168, start = 0

 

Does this mean, ioctrl is not properly working on those drives or is this a gap in the command?

 

BTW - there were postings, that with the newer linux kernels the global lock situations should have been improved to avoid just such a behaviour, do we have those improvements already in the UR-kernel?

Link to comment

Here is the cause of the (infamous?) "different disk spinning up causes my media steam to stall" problem.  It turns out this is a disk controller hardware limitation, but I have an idea for a workaround.

 

Some background: most of you will recall when IDE drives ruled the PC world and each IDE "controller" could have two hard drives attached, a master and a slave.  What is less common knowledge is that, at any one time, you could execute an I/O request on either the master drive or the slave drive, but not both simultaneously.  This is a low-level "detail" which, for a variety of reasons, doesn't normally have much impact on the applications using the hard drives.

 

Now SATA rules the PC world, but many early SATA controllers are simply IDE controllers in disguise.  For example, the Promise SATA300 TX4 controller has 4 SATA ports, but internally, there are actually two SATA controllers, each controlling two SATA ports.  This h/w has the same limitation: if you execute an I/O operation on one SATA port, and then try to execute another SATA operation on the other port of the same internal controller, that second I/O operation can not be executed until the first one finishes.

 

So how is this causing media stream stalls?  Below is the device ID info for one of my test servers:

 

parity device: pci-0000:00:1f.2-scsi-1:0:0:0 (sdh) ata-Hitachi_HDS721010KLA330_GTH000PAGXXUXH 
disk1 device: pci-0000:00:1f.2-scsi-1:0:1:0 (sdi) ata-Hitachi_HDS721010KLA330_GTH000PAH00LLH 
disk2 device: pci-0000:00:1f.2-scsi-0:0:0:0 (sdf) ata-Hitachi_HDS721010KLA330_GTH000PAH01B7H 
disk3 device: pci-0000:00:1f.2-scsi-0:0:1:0 (sdg) ata-Hitachi_HDS721010KLA330_GTH000PAH01RYH 
disk4 device: pci-0000:06:00.0-scsi-0:0:0:0 (sda) ata-Hitachi_HDT721010SLA360_STF607MH12GR6B 
disk5 device: pci-0000:06:00.0-scsi-1:0:0:0 (sdb) ata-Hitachi_HDT721010SLA360_STF607MH1GAB2W 
disk6 device: pci-0000:06:00.0-scsi-2:0:0:0 (sdc) ata-ST31000340AS_5QJ0R3PG 
disk7 device: pci-0000:06:00.0-scsi-3:0:0:0 (sdd) ata-ST31000340AS_5QJ0R55H 

 

Take a look at the 4 digits following "scsi".  Notice the first digit for parity and disk1 is the same (a 1), and the third digit is 0 for parity and 1 for disk1.  Similarly for disk2 and disk3, digit 1 is the same (a 0) and third digit again is 0 for disk2 and 1 for for disk3.  This is telling us that parity/disk1 share a common controller, as does disk2/disk3.  Looking at disk4/5/6/7, notice the first digits are 0,1,2,3 and the third digits are all 0.  This tells us each of those disks is on a different controller.

 

Here is the implication.  If we start a media stream off say disk2, and all other disks are spun down, if you try to access disk3, then the media stream will grind to a halt because as soon as the linux disk driver starts an I/O request for disk3, that I/O request will just sit there waiting for the disk to spin up; meantime, I/O requests coming for disk2 (ie, additional media file reads), get backed up until the disk3 request finishes.

 

However, again assuming media being read off disk2, all other disks spun down... If we try to access any other disk besides disk3, all will be well with the media stream - we will not see it stall because access to any other disk other than disk3 will be on a different controller.

 

So this explains quite a bit about why some people see this problem, some don't.  It depends on your hardware, which disk the media is streaming from, and which disk(s) you try to access when the media is streaming.

 

Here is my proposed workaround: I can change the code so that if an I/O request is directed toward a drive that is spun down, then if that drive has a brother on the same controller which is also spun down, we will spin up both drives.  The effect of this is that there might be two drives spun up when a media stream is accessed on an array where all drives are spun down.

 

I would appreciate those who have seen this problem to verify for me that this is the cause, since I don't have access to all the h/w out there to test with.  The way I test this is as follows.  First find a good VOB file to use - I use a music video since if the music stops I know the stream has stalled.  Next, put this VOB file on each disk, e.g., /mnt/disk1/test.vob, /mnt/disk2/test.vob, etc.

 

Now go to webGui and click 'Spin down all'.  Then via Network, drill down to say disk1/test.vob, click on it, notice spin up delay, and then video should start.  Now from console type a command like this:

 

dd of=/dev/null bs=512K count=1K if=/dev/sda

 

Substitute 'sda' for whichever disk you want to cause to spin up and read.  You should be able to verify that the media stream stalls only if a) the meda streaming drive is on a controller that controls two disks, and b) you try to spin-up/read the "other" drive on that same controller - and spin-up/read of any other drive does not stall the media stream.

 

One more unrelated thing I noticed - if you click 'Spin down all', and then click 'Spin down all' again, the second spin down will take a long time.  This is because just before doing a spin-down, the code executes a 'sync'.  Well somewhere along the way, the linux kernel was changed so that a sync re-writes file system superblocks whether they are 'dirty' or not - this results in sync causing all the drives to spin back up, some writes take place, then finally they are all spun down again  :o  Anyway I'll figure out a solution to this one too...

 

Link to comment

Here is the implication.  If we start a media stream off say disk2, and all other disks are spun down, if you try to access disk3, then the media stream will grind to a halt because as soon as the linux disk driver starts an I/O request for disk3, that I/O request will just sit there waiting for the disk to spin up; meantime, I/O requests coming for disk2 (ie, additional media file reads), get backed up until the disk3 request finishes.

 

Yes.   This does appear to be the cause.  I can confirm that my motherboard and add-on SATA controller are doing this.  I didn't think that any SATA controller acted this way... something I'll need to look into now when purchasing new controllers.

 

Here is my proposed workaround: I can change the code so that if an I/O request is directed toward a drive that is spun down, then if that drive has a brother on the same controller which is also spun down, we will spin up both drives.  The effect of this is that there might be two drives spun up when a media stream is accessed on an array where all drives are spun down.

 

I assume that when spinning down due to inactivity, you'll also need to check the activity of a drives siblings, and only spin it down if all siblings have >= timeout of inactivity.

 

Edit:

Are users of SATA port multipliers affected by this as well?

Link to comment

I can confirm that I'm using two Promise TX4 controllers, as well as a PCIe Silicon Image eSATA card with port multiplier. I am looking forward to support for the 8 port PCIe SATA card from SuperMicro to replace the two TX4s.

 

Following the instructions posted by Tom, if you are streaming a video off a drive connected to your port multiplier, and then read data from a spun down drive also on your port multiplier, is the video stream interrupted?

 

Note: I needed to use higher bit rate HD Mpeg2 streams for testing, as SD VOBs didn't always glitch.

 

Link to comment

Are users of SATA port multipliers affected by this as well?

 

Probably not, but that is a good question.

in theory, when using FIS-based switching, you should not be affected; not sure, if command based switching includes those requests.

Edit: I did a quick test and also had a short freeze on a running HD-stream. Using Sil3132 and Sil3726 PM chips.

Link to comment

Are users of SATA port multipliers affected by this as well?

 

Probably not, but that is a good question.

in theory, when using FIS-based switching, you should not be affected; not sure, if command based switching includes those requests.

Edit: I did a quick test and also had a short freeze on a running HD-stream. Using Sil3132 and Sil3726 PM chips.

 

Yeah probably command based switching will have that problem.  Do you know if 3132 is cmd-based or FIS-based?

Link to comment

Here is the cause of the (infamous?) "different disk spinning up causes my media steam to stall" problem.  It turns out this is a disk controller hardware limitation, but I have an idea for a workaround.

 

<snip>

Here is my proposed workaround: I can change the code so that if an I/O request is directed toward a drive that is spun down, then if that drive has a brother on the same controller which is also spun down, we will spin up both drives.  The effect of this is that there might be two drives spun up when a media stream is accessed on an array where all drives are spun down.

<snip>

Tom: Perhaps you can make the "pair-spin-up" a tunable feature, to satisfy both those who want their movie watching experience to not suffer, and to please those who care about every "watt" consumed and could care less about the spin-up-delay pause of their data stream.

 

brainbone: Good point about the spin-down logic needing to take account of the pairing of the spin up, otherwise, an hour after a movie is in progress the same pause could occur when the paired drive is spun up when accessed after it has spun down.

 

Joe L.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.