Still getting DMA errors

April 22, 200620 yr

Well, I believe it is better than the previous release, but I'm still getting DMA errors on the one disk run by the promise controller. Here is what the syslog has to say.

Apr 22 10:27:16 Tower kernel: hdi: dma_intr: error=0x40 { UncorrectableError },

LBAsect=61166819, high=3, low=10835171, sector=61166752

Apr 22 10:27:16 Tower kernel: end_request: I/O error, dev 38:01 (hdi), sector 61

166752

Apr 22 10:27:16 Tower kernel: md4: read error!

Apr 22 10:27:16 Tower kernel: end_read_request 61166752/2, count: 1, uptodate 0.

Apr 22 10:27:17 Tower kernel: get_token: status

Quote

April 22, 200620 yr

That looks like a "legitimate" read error.

FYI, here's one old thread (of many) in the linux-kernel archive which talks about this. Click on the "Next in thread" link to follow the discussion. Note: I'm not advocating trying any of the things talked about in the thread.

Quote

April 23, 200620 yr

Author

As this drive is only 3 months old and has a 5 year warranty, I may have to look into sending it back to Seagate for replacement.

My drives never get about 35C, typically they run around 30C so I don't think heat is a problem. I only have three drives on the one power supply so don't think that is an issue.

Any ideas on what can be done or does this means it is time to get a new drive?

thanks,

Quote

April 23, 200620 yr

There are some other things you can try.

1. Swap it with a drive connected to the on-board IDE controllers, e.g, swap disk4 with disk2.

or

2. Try forcing it to use a slower DMA mode:

To set a drive to a specific Ultra DMA mode, you need to edit the "go" file on the Flash. You will see a series of "hdparm" commands used to set up each disk. For example, for disk9, there's:

hdparm -c1d1a0m8A1W1u1  /dev/hdf

to force this disk to use Ultra DMA mode 4, change the line to this:

hdparm -c1d1a0m8A1W1u1 -Xudma4  /dev/hdf

Quote

April 24, 200620 yr

Author

So I took the drive out and installed it in one of my XP machines and ran the SMART check on it. It says there was nothing wrong with the drive. Turns out I was wrong, it is a WD 320GB drive, not a Seagate.

I put it back into the tower and set the hdi drive to have the -Xudma4 setting. Where can I look to verify this happened? I did not see anything obvious in messeges or syslog.

thanks,

Quote

April 24, 200620 yr

Author

I'm now running a parity check and getting loads of read errors. What does that mean? I assume I need to replace the drive, but does this mean that the UnRaid has fixed the read error so me data is still intact?

I'll be processing my warranty now.

Quote

April 26, 200620 yr

If you are doing a parity-check, and there is a read error reported by disk, then the UnRaid software will calculate the invalid data and then write it back to the disk which reported the read error. If this write subsequently also fails, then the disk is disabled.

In your case, each time you see a read error during parity-check you should also see the Write counter increment.

Quote

April 26, 200620 yr

Author

I guess that is where the writes were coming from.

I just installed my new disk and am rebuilding the array.

I'm still getting erors during the rebuild on the brand new disk.

Apr 26 18:13:54 Tower kernel: hdi: lost interrupt

Apr 26 18:15:30 Tower kernel: hdi: lost interrupt

Apr 26 18:40:27 Tower kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Apr 26 18:40:27 Tower kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }

The rebuild time looks to be about 26hours again.

Something is just not right at all as I can't believe that the hard drive is really bad, it is brand new!

I think the promise controller support is just lacking. I currently have a total of 3 drive in my system, so I would like to move the drives around to not use the promise controllers.

parity, master on MB

<empty> slave on MB

drive 2 master on secondary MB

<empty> slave on secondary MB

drive 4 master on 1st Promise

drive 4 is rebuilding.

Can I stop the rebuild, remove drive 4, stick it in as drive 1 or 3 and then rebuild? or do I have to wait the 25 hours till the rebuild happens?

thanks,

Quote

April 26, 200620 yr

If there is data on disk4 that you want to keep,

then you must let the rebuild wholly complete. (with apologies to Johnny Cochran)

The 'lost interrupt' you see on hdi (disk4) is not a concern; it is informational only.

The 'BadCRC' error being reported for hda (parity) is a bigger concern. Note that this is the m/b primary controller, not the Promise controller. Check your cables and cable seating. When data is transfered from the disk to the m/b via UltraDMA, a CRC is appended to the data transfers via the interface. This is the CRC error being generated here & indicates transmission problem. Could also be weak power supply.

Quote

April 27, 200620 yr

Author

Wow did not even see that it was hda this time.

The problem I was having is that the rebuild would never finish. They system always hung due to a DMA error.

I switched the two promise cards and the array did rebuild.

Currently I'm using a Sparkle FSP300-60THN 300W LGA775 power supply. It is powering the MB and the three drives.

I have a second power supply, should I use it to power the drives, and have the first one only power the MB? I thought 300W would have been plenty for the MB and 3 drives.

Quote

April 27, 200620 yr

Looking at the specs of this power supply from newegg, we see this is a dual-rail power supply:

+12V1 is rated at 8A

+12V2 is rated at 14A

The +12V2 is supposed to be the rail for the 4-pin connector to the motherboard for CPU power; and, the +12V1 is for all the other components (as I understand it). I would not be comfortable with only 8A for the disk drives, but in reality, once they are spun up, probably should work.

Here is a page which explains a bit about dual-rail supplies (discussion starts about half way down the page).

Probably with only 3 drives this should work fine, but since you have another supply, it's an easy to try it out.

Another quick check is to go to the bios Advanced -> Hardware Monitoring page and look at the voltages displayed there. I once had a flaky PC which would fail in all sorts of ways and a quick check here showed my +5V was reading +4.1V. Changed the power supply and system has been rock solid ever since.

Another thing to try is to bypass any mobile racks you might be using by simply connecting drives directly to cables & see if this decreases or eliminates the error rate.

Quote

May 12, 200620 yr

Tom

I have contacted you offline shortly after buying my system about 1 of my bays not working, but thought I would share the problem here in case anyone else has any ideas. I am not an advanced user, so it took me a while to telnet the server and find out that it appears I have the now famous dma errors!

Here's the interesting bit though - I always have disk 5 showing as disabled in the main screen - even though I have done the following:

1. Tried different drives - I have put at least 4 different (all new) drives in that bay to no avail

2. I have swapped cases, in case it was a bad connection in one of the cases causing the problem

3. Swapped power cables in case it was a low voltage (or other power supply) problem to that bay

4. Tried the Ultra DMA 4 solution that Tom has included in a couple of posts here

5. I have powered down and pulled out the IDE controller card and reseated it as well as all cables

I therefore know it is not a problem with my drives - they each worked (and all continue to work) when put into a different bay and I had the array re-built. For some reason, anything I put in bay number 5 (drive j) does not work. Everything else continues to work fine - just not this damn bay 5.

I finally got into telnet and looked at the syslog and I see a whole list of what appear to my untrained eye to be dma errors.

So, I am at a loss as to what to do next. I am running version 1.050930 and understand that the later beta version might help, but can someone indicate whether they have seen this same issue as most other posts seem to point to a particular drive, not a particular bay. At this point, I would prefer to only change something that I know is likely to fix the problem. I have a lot to lose by experimenting (1.5TB to be exact!).

Any offers of help??

Here is an excerpt from my syslog in case that helps point me in the right direction:

te Error }