Still getting DMA errors


Recommended Posts

Well, I believe it is better than the previous release, but I'm still getting DMA errors on the one disk run by the promise controller.  Here is what the syslog has to say.

 

Apr 22 10:27:16 Tower kernel: hdi: dma_intr: error=0x40 { UncorrectableError },

LBAsect=61166819, high=3, low=10835171, sector=61166752

Apr 22 10:27:16 Tower kernel: end_request: I/O error, dev 38:01 (hdi), sector 61

166752

Apr 22 10:27:16 Tower kernel: md4: read error!

Apr 22 10:27:16 Tower kernel: end_read_request 61166752/2, count: 1, uptodate 0.

Apr 22 10:27:17 Tower kernel: get_token: status

Link to comment

As this drive is only 3 months old and has a 5 year warranty, I may have to look into sending it back to Seagate for replacement.

 

My drives never get about 35C, typically they run around 30C so I don't think heat is a problem.  I only have three drives on the one power supply so don't think that is an issue.

 

Any ideas on what can be done or does this means it is time to get a new drive?

 

thanks,

 

Link to comment

There are some other things you can try.

 

1. Swap it with a drive connected to the on-board IDE controllers, e.g, swap disk4 with disk2.

 

or

 

2. Try forcing it to use a slower DMA mode:

 

To set a drive to a specific Ultra DMA mode, you need to edit the "go" file on the Flash.  You will see a series of "hdparm" commands used to set up each disk.  For example, for disk9, there's:

 

hdparm -c1d1a0m8A1W1u1  /dev/hdf

 

to force this disk to use Ultra DMA mode 4, change the line to this:

 

hdparm -c1d1a0m8A1W1u1 -Xudma4  /dev/hdf

 

Link to comment

So I took the drive out and installed it in one of my XP machines and ran the SMART check on it.  It says there was nothing wrong with the drive.  Turns out I was wrong, it is a WD 320GB drive, not a Seagate.

 

I put it back into the tower and set the hdi drive to have the -Xudma4 setting.  Where can I look to verify this happened?  I did not see anything obvious in messeges or syslog.

 

thanks,

 

Link to comment

If you are doing a parity-check, and there is a read error reported by disk, then the UnRaid software will calculate the invalid data and then write it back to the disk which reported the read error.  If this write subsequently also fails, then the disk is disabled.

 

In your case, each time you see a read error during parity-check you should also see the Write counter increment.

Link to comment

I guess that is where the writes were coming from.

 

I just installed my new disk and am rebuilding the array.

 

I'm still getting erors during the rebuild on the brand new disk.

 

Apr 26 18:13:54 Tower kernel: hdi: lost interrupt

Apr 26 18:15:30 Tower kernel: hdi: lost interrupt

Apr 26 18:40:27 Tower kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Apr 26 18:40:27 Tower kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }

 

The rebuild time looks to be about 26hours again.

 

Something is just not right at all as I can't believe that the hard drive is really bad, it is brand new!

 

I think the promise controller support is just lacking.  I currently have a total of 3 drive in my system, so I would like to move the drives around to not use  the promise controllers.

 

parity, master on MB

<empty> slave on MB

drive 2 master on secondary MB

<empty> slave on secondary MB

drive 4 master on 1st Promise

 

drive 4 is rebuilding.

 

Can I stop the rebuild, remove drive 4, stick it in as drive 1 or 3 and then rebuild?  or do I have to wait the 25 hours till the rebuild happens?

 

thanks,

 

Link to comment

If there is data on disk4 that you want to keep,

then you must let the rebuild wholly complete.  (with apologies to Johnny Cochran)

 

The 'lost interrupt' you see on hdi (disk4) is not a concern; it is informational only.

 

The 'BadCRC' error being reported for hda (parity) is a bigger concern.  Note that this is the m/b primary controller, not the Promise controller. Check your cables and cable seating.  When data is transfered from the disk to the m/b via UltraDMA, a CRC is appended to the data transfers via the interface.  This is the CRC error being generated here & indicates transmission problem.  Could also be weak power supply.

Link to comment

Wow did not even see that it was hda this time.

 

The problem I was  having is that the rebuild would never finish.  They system always hung due to a DMA error.

 

I switched the two promise cards and the array did rebuild.

 

Currently I'm using a Sparkle FSP300-60THN 300W LGA775  power supply.  It is powering the MB and the three drives.

 

I have a second power supply, should I use it to power the drives, and have the first one only power the MB?  I thought 300W would have been plenty for the MB and 3 drives.

 

 

Link to comment

Looking at the specs of this power supply from newegg, we see this is a dual-rail power supply:

+12V1 is rated at 8A

+12V2 is rated at 14A

 

The +12V2 is supposed to be the rail for the 4-pin connector to the motherboard for CPU power; and, the +12V1 is for all the other components (as I understand it).  I would not be comfortable with only 8A for the disk drives, but in reality, once they are spun up, probably should work.

 

Here is a page which explains a bit about dual-rail supplies (discussion starts about half way down the page).

 

Probably with only 3 drives this should work fine, but since you have another supply, it's an easy to try it out.

 

Another quick check is to go to the bios Advanced -> Hardware Monitoring page and look at the voltages displayed there.  I once had a flaky PC which would fail in all sorts of ways and a quick check here showed my +5V was reading +4.1V.  Changed the power supply and system has been rock solid ever since.

 

Another thing to try is to bypass any mobile racks you might be using by simply connecting drives directly to cables & see if this decreases or eliminates the error rate.

 

 

 

 

 

Link to comment
  • 3 weeks later...

Tom

 

I have contacted you offline shortly after buying my system about 1 of my bays not working, but thought I would share the problem here in case anyone else has any ideas.  I am not an advanced user, so it took me a while to telnet the server and find out that it appears I have the now famous dma errors!

 

Here's the interesting bit though - I always have disk 5 showing as disabled in the main screen - even though I have done the following:

1. Tried different drives - I have put at least 4 different (all new) drives in that bay to no avail

2. I have swapped cases, in case it was a bad connection in one of the cases causing the problem

3. Swapped power cables in case it was a low voltage (or other power supply) problem to that bay

4. Tried the Ultra DMA 4 solution that Tom has included in a couple of posts here

5. I have powered down and pulled out the IDE controller card and reseated it as well as all cables

 

I therefore know it is not a problem with my drives - they each worked (and all continue to work) when put into a different bay and I had the array re-built.  For some reason, anything I put in bay number 5 (drive j) does not work.  Everything else continues to work fine - just not this damn bay 5.

 

I finally got into telnet and looked at the syslog and I see a whole list of what appear to my untrained eye to be dma errors.

 

So, I am at a loss as to what to do next.  I am running version 1.050930 and understand that the later beta version might help, but can someone indicate whether they have seen this same issue as most other posts seem to point to a particular drive, not a particular bay.  At this point, I would prefer to only change something that I know is likely to fix the problem.  I have a lot to lose by experimenting (1.5TB to be exact!).

 

Any offers of help??

 

Here is an excerpt from my syslog in case that helps point me in the right direction:

 

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: PDC202XX: Primary channel reset.

May 12 12:02:12 Tower kernel: ide4: reset: success

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: end_request: I/O error, dev 38:40 (hdj), sector 0

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple

te Error }

May 12 12:02:12 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR

C }

May 12 12:02:12 Tower kernel: PDC202XX: Primary channel reset.

 

Link to comment

That's not exactly the same kind of DMA errors which crop up.  The clue is this:

 

DriveStatusError BadCRC

 

"BadCRC" usually indicates a cabling problem between the disk controller and the disk itself.  ["CRC" is a "check code" appended to each transfer between disk and controller, and was introduced to the ATA standard when UDMA came along.]

 

It looks like you have tried reasonable steps.  At this point I would suspect the drive tray itself.  So here is something to try.  You can connect the disk to the cable directly, bypassing the disk tray.  Easiest way to do this is probably to get another cable, connect to the controller card, then connect to your drive and "borrow" the power connect from the bottom bay.

 

If you don't want to screw around with this (and I wouldn't blame you), send me an email and we'll make it right (we can send you another system, and once you're satisfied it's working, you can send the original back - we'll also send you a FedEx shipping label for it).

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.