Recommended Posts

Any idea how to handle this?  Performance grinds to a halt and I see this in the logs:

 

hdk: dma_timer_expiry: dma status == 0x60
hdk: timeout waiting for DMA
PDC202XX: Secondary channel reset.
hdk: timeout waiting for DMA
hdk: (__ide_dma_test_irq) called while not waiting
hdk: status error: status=0x58 { DriveReady SeekComplete DataRequest }

hdk: drive not ready for command
hdk: status timeout: status=0xd1 { Busy }

hdl: DMA disabled
PDC202XX: Secondary channel reset.
hdk: drive not ready for command
ide5: reset: success
blk: queue c0326c44, I/O limit 4095Mb (mask 0xffffffff)
hdk: dma_timer_expiry: dma status == 0x20
hdk: timeout waiting for DMA
PDC202XX: Secondary channel reset.
hdk: timeout waiting for DMA
hdk: (__ide_dma_test_irq) called while not waiting
hdk: status error: status=0x58 { DriveReady SeekComplete DataRequest }

hdk: drive not ready for command
hdk: status timeout: status=0xd1 { Busy }

PDC202XX: Secondary channel reset.
hdk: drive not ready for command
ide5: reset: success
hdk: dma_timer_expiry: dma status == 0x20
hdk: timeout waiting for DMA
PDC202XX: Secondary channel reset.
hdk: timeout waiting for DMA
hdk: (__ide_dma_test_irq) called while not waiting
hdk: status error: status=0x58 { DriveReady SeekComplete DataRequest }

hdk: drive not ready for command
hdk: status timeout: status=0xd1 { Busy }

PDC202XX: Secondary channel reset.
hdk: drive not ready for command
ide5: reset: success
hdk: dma_timer_expiry: dma status == 0x20
hdk: timeout waiting for DMA
PDC202XX: Secondary channel reset.
hdk: timeout waiting for DMA
hdk: (__ide_dma_test_irq) called while not waiting
hdk: status error: status=0x58 { DriveReady SeekComplete DataRequest }

hdk: drive not ready for command
hdk: status timeout: status=0xd1 { Busy }

PDC202XX: Secondary channel reset.
hdk: drive not ready for command
ide5: reset: success

 

I'm using Promise 133 controllers in this, could that be the problem?  I have four Unraid boxes running and this is the first one I've seen this on.

 

Link to comment

smeehrrr,

 

Make sure that your Promise 133 TX2 PATA PCI cards have the latest firmware installed.

 

You can check for and get the latest firmware off of the Promise website.

 

Yup, they're current.  I have a floppy I use when building these machines that updates the BIOS and then updates the IDE firmware.

 

Link to comment

This is the DMA issue that I and many others have suffered with, have you read the threads about this.  Basically the UnRaid OS in it's current state does not handle DMA errors well at all.  The grind to a halt you describe to many of us looked like total lockups.  You have one or more drives that UnRaid does not like at all, until Tom provides a fix for this your only option is to pull the suspect drive(s) out of the system. Looking at the log it looks like HDk is the culprit which is Disk6 or the 7th counting the parity drive as #1 .  For reference:

 

    hda = parity

    hdb = disk1

    hdc = disk2

    hdd = disk3

    hdi = disk4

    hdj = disk5

    hdk = disk6

    hdl = disk7

    hde = disk8

    hdf = disk9

    hdg = disk10

    hdh = disk11

 

If I were you I would pull that drive out, reset the array and see if the problem goes away.

Link to comment

DeathtoToasters,

 

The version of the firmware on the Promise card is usually on a label attached to or stenciled on to the LSI controller chip.

 

As smeehrrr said in a previous post, you will need a machine that has a floppy disk as updating the Promise controller firmware requires operating in the DOS domain.

 

Regards,

TCIII

Link to comment

I has some success in moving the drive with DMA errors from the promise controlled to the motherboard controller.  If you happen to have a slot open on the motherboard controller you could try moving it.

 

You are supposed to be able to transparently swap drives around the UnRaid system, and have it recognize them and continue, but I don't think this works currently.

 

 

Link to comment

...

 

You are supposed to be able to transparently swap drives around the UnRaid system, and have it recognize them and continue, but I don't think this works currently.

 

 

This works in the current s/w as long as the data disks occupy the same set of slots.  For example, suppose you have parity, disk1, disk2, and disk3.  You can swap around the drives amoung disk1, disk2, and disk3 slots.  But you can not put, say, disk1 in disk4 slot (leaving disk1 slot empty).  This will be fixed in a future release.

Link to comment

Tom,

 

Not sure if these links will give you any clues, but I did a bit of searching with google and perhaps one of them might help.

 

This following message from a Linux kernel mailing list seems to describe the issue some are having with lockups.  It specifically mentions the 20265 and 20267 chipset on the Promise controller, but the file it is being applied to is named pcd202XX so it may apply to the 20268 chipset on the current Promise cards.

 

http://marc.theaimsgroup.com/?l=linux-kernel&m=104250818527780&w=2

 

This following link is to a very long thread of messages that describe a similar patch for the 20265 thru 20270 chipsets

http://kerneltrap.org/node/3040  It contains links to various patches including proposed patches to the 2.4 and 2.6 kernels.

 

Another thread describes how the Promise card ends up in an infinite loop, waiting for an interrupt that never occurs because interrupts have been disabled.

http://www.uwsg.iu.edu/hypermail/linux/kernel/0102.0/1334.html

 

It  suggests replacing a "while" loop with a call to a timer. 

This

unsigned long timeout = jiffies + ((HZ + 19)/20) + 1;
while (0 < (signed long)(timeout - jiffies));

 

gets replaced with this

mdelay(50);

Notice nothing in the "while loop" decrements the counter. Apparently, the counter is decremented in an interrupt driver routine.  If the interrupt never occurs, the loop runs forever.  (These new 3GHz machines run infinite loops reallllllly fast, but they still seem to take forever to run  :( :()

 

It mentions that if you have the NMI Watchdog enabled, you will get nice "oopses."

 

NMI (non-maskable interrupt) watchdog can be enabled as shown on this link http://slacksite.com/slackware/nmi.html

by adding "nmi_watchdog=1" to the boot parameters in the grub menu.

 

This does not fix the Promise controller  bug, but it will apparently reboot the system if it gets hung.  Might this be something an unRaid user could try.  Do we have support for the NMI watchdog in our kernels?

 

Joe L.

Link to comment

Thanks Joe.  I've seen those threads and am investigating.  Some of the patches don't apply to our kernel, some are slightly different, some have already been applied  ???

 

Using the NMI watchdog to at least reboot is an interesting idea.  It's currently not enabled, but I'll look into this.

Link to comment
  • 4 weeks later...

I *may* have stumbled across a potential "fix" for the UDMA errors!!! Mind you I'm not ever seeing them but reading up on another Linux based NAS I stumbled across a thread where these guys had the same sorts of issues. They figured out a way to stop it by reducing the DMA access.

 

http://sourceforge.net/forum/forum.php?thread_id=1458967&forum_id=507589

 

Aparrently there's a way to reduce the DMA access level to solve this using utilities. The drive manufacturers have utilities and there's a utlity in BSD too so maybe something exists in Linux too however the manufacturer stuff is better since it sets it and sticks.

 

Hope that helps some!

Link to comment

One of the issues we've had in nailing down this problem is that we don't have any disks which consistenly exhibit it.  However, we have an identical set of 12 Hitachi HDS722525VLAT80's, of which one occasionally fails with DMA error.  After several weeks of testing, here are our conclusions:

 

1. The suspect disk never fails if connected to either connector of the on-board IDE controllers.

2. The disk will only fail if on the Secondary connector of a Promise controller.

3. Doesn't matter if using a "short" (18") or "long" (24") cable.

4. When it does fail, it does hang the system completely.

 

It does appear however, that "slowing the disk down" solves the problem (or at least works around it).  The disk normally wants to operate in "Ultra DMA mode 5", but if forced to run in "Ultra DMA mode 4" it no longer fails with DMA error and system is only very slightly slowed down.

 

To set a drive to a specific Ultra DMA mode, you need to edit the "go" file on the Flash.  You will see a series of "hdparm" commands used to set up each disk.  For example, for disk9, there's:

 

hdparm -c1d1a0m8A1W1u1  /dev/hdf

 

to force this disk to use Ultra DMA mode 4, change the line to this:

 

hdparm -c1d1a0m8A1W1u1 -Xudma4  /dev/hdf

 

We haven't given up on solving this problem "correctly" however.

 

 

Link to comment

Tom,

 

Didn't you say that the new kernel was less DMA error prone that the last one and the 324 upgrade would fix the DMA issue...?  My 2 drives that were killing me have not be removed so I can't tell if running 324 (which I am) has solved the problem or not.  I have one of the 2 drives still, maybe I should stick it back in and see for myself.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.