Jump to content
dalben

ATA or Hard Disk Errors - Help [SOLVED]

14 posts in this topic Last Reply

Recommended Posts

The system  (5.0 b12) is running a Parity Check and these errors are being thrown up every second:

 

Aug 30 07:28:11 tdm kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (Drive related)
Aug 30 07:28:11 tdm kernel: ata4.00: configured for UDMA/33 (Drive related)
Aug 30 07:28:11 tdm kernel: ata4: EH complete (Drive related)
Aug 30 07:28:11 tdm kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen (Errors)
Aug 30 07:28:11 tdm kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)
Aug 30 07:28:11 tdm kernel: ata4: SError: { UnrecovData 10B8B BadCRC } (Errors)
Aug 30 07:28:11 tdm kernel: ata4.00: failed command: READ DMA EXT (Minor Issues)
Aug 30 07:28:11 tdm kernel: ata4.00: cmd 25/00:00:b8:65:1c/00:04:81:00:00/e0 tag 0 dma 524288 in (Drive related)
Aug 30 07:28:11 tdm kernel:          res 50/00:00:af:88:e0/00:00:e8:00:00/e0 Emask 0x10 (ATA bus error) (Errors)
Aug 30 07:28:11 tdm kernel: ata4.00: status: { DRDY } (Drive related)
Aug 30 07:28:11 tdm kernel: ata4: hard resetting link (Minor Issues)

 

and every now and again I see these:

 

Aug 30 07:27:45 tdm kernel: handle_stripe read error: 2166068552/3, count: 1 (Errors)
Aug 30 07:27:45 tdm kernel: md: disk3 read error (Errors)
Aug 30 07:27:45 tdm kernel: handle_stripe read error: 2166068560/3, count: 1 (Errors)
Aug 30 07:27:45 tdm kernel: md: disk3 read error (Errors)
Aug 30 07:27:45 tdm kernel: handle_stripe read error: 2166068568/3, count: 1 (Errors)
Aug 30 07:27:45 tdm kernel: md: disk3 read error (Errors)
Aug 30 07:27:45 tdm kernel: handle_stripe read error: 2166068576/3, count: 1 (Errors)
Aug 30 07:27:45 tdm kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen (Errors)
Aug 30 07:27:45 tdm kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)
Aug 30 07:27:45 tdm kernel: ata4: SError: { UnrecovData 10B8B BadCRC } (Errors)

 

Obviously the parity check is treacle slow.

 

On the Disk 3 errors, there is no data on the drive at the moment.

 

Is it safe to restart the array to try and get some response back or should I do something else.

 

zipped syslog attached, at least the last 5k lines of it.

tdm_syslog_20110830.zip

Share this post


Link to post

Does it make sense to shutdown and restart the server ? Or should I stop the parity check ?

Share this post


Link to post

OK, it looks like the drive is dead.  It doesn't show up when the server POSTs.  They don't make drives like they used too ......

Share this post


Link to post

OK, it looks like the drive is dead.  It doesn't show up when the server POSTs.  They don't make drives like they used too ......

or the power/data cable to the drive is loose, or intermittent, or has failed.

Share this post


Link to post

Yeah, good popint.  Opened the box, replugged the drive, turned it on and it's there again.

 

It's doing a Data-Rebuild at the moment so we'll see how that goes.

 

Are there any drive test scripts in unRaid that I can use to test the drive, or should a data rebuild be enough to flush out any errors ?

Share this post


Link to post

Well, the disk died on me again.  syslog attached.  Errors start throwing up at around 10:39

 

Can anyone tell if it's a trashed drive that needs to be replaced or whether it's salvageable ?

tdm_drive3_failed.txt

Share this post


Link to post

It is most likely a dead drive. But you could try to pre-clear it 2 or 3 times and see how it does.

Share this post


Link to post

It is most likely a dead drive. But you could try to pre-clear it 2 or 3 times and see how it does.

 

Luckily, while part of the array, it was empty.  I've RMA'd it and bought a new drive to use for emergencies.  It'd going through preclear now.

 

But, on another thread Tom suggests it may not be the drive but cable/controller/PSU issues.  I'd much prefer if it was just a dud drive.  Mini-ITX towers crammed with disks are a bitch to work on.

Share this post


Link to post

Luckily, while part of the array, it was empty.  I've RMA'd it and bought a new drive to use for emergencies.  It'd going through preclear now.

 

But, on another thread Tom suggests it may not be the drive but cable/controller/PSU issues.  I'd much prefer if it was just a dud drive.  Mini-ITX towers crammed with disks are a bitch to work on.

 

As others have said, it is almost certainly either a cable or power or controller issue, and that drive is almost certainly completely fine.  One of your syslogs (tdm array starting.txt) included (among other error flags) the BadCRC flag, which immediately points to a bad SATA cable, which is probably the easiest fix possible.  But your other syslog (tdm drive3 failed.txt) does not include BadCRC, so the cable becomes only a minor suspect, and power issues become a bigger suspect.

 

Both syslogs include the following line:

 tdm kernel: ata4.00: disabled

 

Once the kernel has marked the drive as disabled, because it has given up trying to communicate with it, you can completely ignore ALL subsequent errors related to that drive!  In this case, that includes all error handling involving ata4, all errors related to disk3 or sdc, and all of the handle_stripe errors.  Any attempt by unRAID to access that drive is going to cause an error to be reported, because the kernel no longer considers that drive to be present.  Only a reboot will allow that drive to reappear.

 

The particular error flags you got are unfortunately hard to interpret, that is, they don't point to just one suspect.  In both syslogs, the drive was correctly identified and setup initially, with SATA link speed of 3.0gbps.  But fairly quickly, communications degraded badly to the drive, quicker in one syslog than the other, but it did not take long at all before the kernel gave up even trying to reconnect.  All of the error flags look like interface issues to the drive, none look like drive issues.  Unfortunately, there have been many perfectly good drives RMA'd because of errors due to interface issues (bad or loose cables/power/controller/drivers).

 

I would try another SATA cable (because that is the easiest thing to replace), then carefully check the power connections to the drive, especially any power splitter in the path.  Try connecting it to a different drive controller, with different cables, power and SATA.  If you are short of cables at hand, try swapping cables with another good drive, and see if the problems shift to that drive.  You don't have many drives, so it's hard to see how other power issues could apply, such as too many drives on a power rail, overloaded power supply, etc.

Share this post


Link to post

Thanks for that Rob.

 

I have a couple of SATA cables lying around so I will try swapping it our as a first step.  I am using the cables that came with the Mobo at the moment.

 

The PSU is an FSP300-60GHS-R 80plus jobby.  It didn't have enough sata power plugs so I am using 2 x 1 to 3 splitters to get power to everything.  I can't avoid splitters unless I change PSU, something I really don't want to do unless I have too.  I'll might move the drive up/down the splitter to see if it's a dodgy plug.

 

I actually ran the system for about 3 days before the errors appeared so I assumed it had to be a decaying drive as opposed to cable/power.  Quick question, would the errors I experienced appear during a preclear ?  I am running it now on a new drive.

Share this post


Link to post

Thanks for that Rob.

 

I have a couple of SATA cables lying around so I will try swapping it our as a first step.  I am using the cables that came with the Mobo at the moment.

 

The PSU is an FSP300-60GHS-R 80plus jobby.  It didn't have enough sata power plugs so I am using 2 x 1 to 3 splitters to get power to everything.  I can't avoid splitters unless I change PSU, something I really don't want to do unless I have too.  I'll might move the drive up/down the splitter to see if it's a dodgy plug.

 

I actually ran the system for about 3 days before the errors appeared so I assumed it had to be a decaying drive as opposed to cable/power.  Quick question, would the errors I experienced appear during a preclear ?  I am running it now on a new drive.

According to your signature line you have 5 disk drives.

 

According to this manual from your power supply, it is a 2 rail supply, with a 14Amp 12V1 rail and a 16Amp 12V2 rail.

http://www.fspgroupusa.com/fsp30060ghs80/p/606.html

 

According to the wiring diagram here:

http://www.fspgroupusa.com/download/specs/FSP300-60GHS_DIA.pdf

The 12V2 rail powers the P3 connector  (the 4 pin one used to power the CPU on the motherboard)  

 

The 24 pin motherboard connector and all the SATA and MOLEX connectors share the 12V1 rail rated for 14 Amps.  

 

5 disks, each with approx 3 ampere draw when spinning up, plus the current used by the motherboard, let's guess it is 3 amps, plus that used by your case fans, let's say 1 more amp....  15+3+1 = 19 Amps, on a supply rail rated for 14.  

 

Do you see an issue?   Do you think it might just have an effect.

 

I strongly suggest you invest in a single 12V rail power supply with at least a 35 to 40 Amp capacity.

 

You are just going to be pulling your hair out otherwise.

 

Joe L.

Share this post


Link to post

The PSU is an FSP300-60GHS-R 80plus jobby.  It didn't have enough sata power plugs so I am using 2 x 1 to 3 splitters to get power to everything.  I can't avoid splitters unless I change PSU, something I really don't want to do unless I have too.  I'll might move the drive up/down the splitter to see if it's a dodgy plug.

 

I actually ran the system for about 3 days before the errors appeared so I assumed it had to be a decaying drive as opposed to cable/power.  Quick question, would the errors I experienced appear during a preclear ?  I am running it now on a new drive.

 

FSP power supplies are generally lower cost, but good quality, so should normally be fine.  But only 300W?  That seems really iffy to me, and *may* be the source of your problems, especially since you are having to use power splitters on the same power rail.  If nothing else turns up, I would strongly consider upgrading it.

 

Errors could have occurred during a preclear, but if the drive was able to recover communications and continue, the only way you would know is if you had happened to check the syslog, or perhaps noticed a slow down.

 

Edit: Aah, I see Joe noticed that PSU too!

Share this post


Link to post

hmmmm, ok.  Time to get dirty with screwdrivers it seems.

 

I have a Seasonic X-400 Fanless in my desktop (overkill as I have an i5 and an SSD Drive in it plugged into a H67 Mobo) so I'll swap them over and see how it goes.  I hope the Seasonic fanless can cope with the final resting place of the server which I was hoping will be in a well ventilated cupboard.

 

Thanks for the advice guys.  Much appreciated.

Share this post


Link to post

I replaced the cables and installed an FSP Aurum GOLd 400w CPU inside the box and I haven't seen the errors re-appear.

 

I just hope the RMA'd disk I got back that is labelled "certified repaired" is actually a disk that w3as sent in like mine, i.e., nothing wrong with it.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.