Lost 1 drive, havoc ensued


Recommended Posts

Doh, I did not copy the syslog when I did the parity check.  That was dumb.  But here's the smart report for the parity drive.  Bold entries are the interesting ones.  Reported_Uncorrect is up to 56 from 0 while Current_Pending_Sector and Offline_Uncorrectable went from 10 to 0.

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   114   100   006    Pre-fail  Always       -       76607144

  3 Spin_Up_Time            0x0003   094   091   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       380

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       6830037

  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       558

10 Spin_Retry_Count        0x0013   100   099   097    Pre-fail  Always       -       26

12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       -       242

184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   044   044   000    Old_age   Always       -       56

188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   074   051   045    Old_age   Always       -       26 (Lifetime Min/Max 24/26)

194 Temperature_Celsius     0x0022   026   049   000    Old_age   Always       -       26 (0 15 0 0)

195 Hardware_ECC_Recovered  0x001a   051   032   000    Old_age   Always       -       76607144

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

 

Based on the smart reports, I am pretty sure that your data cable from your SATA port to your drive is bad.  If the drive was bad, all of those disk errors found during the parity check would have meant reallocated disk sectors.  But when the cable is not getting good connection, commands sent from the computer to the drive get garbled.  Often the commands appear invalid to the drive, and it logs errors in the smart log (notice you have 56 such error logs). 

 

Although I say it is most likely the cable, it could be other things in the signal path.  For example, if you are using a backplane or other type of disk docking mechanism, the problem could be inside of that.  In rare situations the physical port may be bad (I had a finicky port that needed a locking SATA cable to create a good consistant connection). 

 

I recommend a new data cable.  If you have a spare port I would suggest trying that as well.  If a new port and new data cable solves the problem, you at least know it isn't the drive.  Cables are cheap, just throw the suspect one out.  If the port is the problem, get a locking cable.

 

I do note that it seems odd that your pending sectors corrected themselves.  Normally pending bad sectors become reallocated sectors - but occasionally when they get tested the final time and work they are put back in service.  I can't explain why that happened, but would recommend keeping a close eye on the parity drive to see if bad sectors start cropping up on subsequent parity checks.

Link to comment

I would also suggest a SMART long test on the parity drive, see the bottom of the Obtaining a SMART report section.  First, hook it up using your best 80-wire flat IDE ribbon cable, no more than 24 inch.

 

I'm pretty sure the parity disk is a SATA drive, not IDE.

 

The ATA Error Count of 56 does not account for the 961 errors reported by unRAID.  I really wish we had that syslog.

 

I noticed that too.  Bit of a math problem.  But since there is no indication of actual disk errors, I have to beleive it is a bad cable connection.  Perhaps some of the read requests never registered with the drive?

Link to comment
I'm pretty sure the parity disk is a SATA drive, not IDE.

 

You are right, guess I was thinking of hda.  Disregard my cable comment.

 

But since there is no indication of actual disk errors, I have to beleive it is a bad cable connection.  Perhaps some of the read requests never registered with the drive?

 

I don't know, it is not behaving normally at all.  I am not sure what those Reported_Uncorrect errors are, docs aren't very helpful.  And I'm still somewhat uncomfortable in concluding yet that this is a cable problem, because I have not seen a single one of the usual errors associated with cabling issues.  Perhaps they were in the syslog ...  These Uncorrectable errors and the DMA timeout errors are not really specific enough, or documented well enough, to indicate what is the real problem.  And another thing that is troubling me, is that I don't see how any of this could be related to the crashes or freezes.  The DMA errors certainly could, if they are actually happening on the system side, not the drive side.  If the DMA errors were occurring on the drives, or because of cabling issues, then I don't see how they can be related to the crashes AT ALL.

Link to comment

I don't know, it is not behaving normally at all.  I am not sure what those Reported_Uncorrect errors are, docs aren't very helpful.  And I'm still somewhat uncomfortable in concluding yet that this is a cable problem, because I have not seen a single one of the usual errors associated with cabling issues.  Perhaps they were in the syslog ...  These Uncorrectable errors and the DMA timeout errors are not really specific enough, or documented well enough, to indicate what is the real problem.  And another thing that is troubling me, is that I don't see how any of this could be related to the crashes or freezes.  The DMA errors certainly could, if they are actually happening on the system side, not the drive side.  If the DMA errors were occurring on the drives, or because of cabling issues, then I don't see how they can be related to the crashes AT ALL.

 

You may be right.  Normally those disk error logs are indicators of cabling issues, but each one seems to be failing on the same command ...

 

  27 00 00 00 00 00 e0 00      01:44:52.943  READ NATIVE MAX ADDRESS EXT

 

I don't know too much about DMA and whether that would mean a motherboard issue of a drive issue.

 

If it were me, I'd rule out the cabling problem first.  If these problems continued after trying new cables I'd start to dig deeper.  Capturing a syslog while doing a parity check would be a good place to start.

Link to comment

I recommend a new data cable.  If you have a spare port I would suggest trying that as well.  If a new port and new data cable solves the problem, you at least know it isn't the drive. 

 

I have new cables coming in.  (1) 10 inch short cable, and another 18inch locking type.  What would constitute "new data cable solves the problem"?  Would the Reported_Uncorrect stay the same or go back to 0?

 

Since we looked at so many different possible scenarios, I'm not sure what is actually at fault now.  I heard drive powering down then up in a split second, possible sector errors that cleared up, and sync and write errors on the parity drive.  If it's just the cable for the parity drive, then replacing that would mean happily ever after?  Going forward, what else should I look out for?

 

Another parity check just finished, 0 errors.  I'm attaching the syslog after I ran the second parity check.  Probably not useful, since there were no errors.

 

Currently running a long smart report.

Link to comment

Actually, it seem the long smart test alrady finished?  The rerpot says status compeleted.  did I read that right?  It only took minutes.

 

I ran: 

 

root@unRAID:/boot# smartctl -d ata -tlong /dev/sda

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".

Drive command "Execute SMART Extended self-test routine immediately in off-linemode" successful.

Testing has begun.

Please wait 120 minutes for test to complete.

Test will complete after Thu May 21 23:15:39 2009

 

Use smartctl -X to abort test.

root@unRAID:/boot# smartctl  -a  -d  ata  /dev/sda > smart.txt

 

Link to comment

Your smart report contains this:

 

[pre]

Self-test execution status:      (  89) The previous self-test completed having

the electrical element of the test

failed.

[/pre] Normally it would have taken 120 minutes or so to complete as it printed when you invoked the long test.

Instead, it failed and aborted the test.

 

I don't have any experience to know if this indicates a failure internal to the drive, or a cabling/power supply could still possibly be the cause...

 

 

Link to comment

At last we have forced something serious to the surface.  This could easily explain many or all of the issues you have been having.  As Joe said though, we don't yet know *where* the problem is, but at least we have a good idea *what* the problem is.  Some possibilities:

* defective drive

* defective or loose power cable to the drive

* defective or loose power splitter or Molex-to-SATA adapter

* bad power supply

* power problems on the motherboard?

Link to comment

At last we have forced something serious to the surface.  This could easily explain many or all of the issues you have been having.  As Joe said though, we don't yet know *where* the problem is, but at least we have a good idea *what* the problem is.  Some possibilities:

* defective drive

* defective or loose power cable to the drive

* defective or loose power splitter or Molex-to-SATA adapter

* bad power supply

* power problems on the motherboard?

Or... it could be internal to the drive.

 

I found this quote here

The criteria for the short self-test are that it has one or more segments and completes in two minutes or less. The criteria for the extended self-test are that it is has one or more segments and that the completion time is vendor specific. Any tests performed in the segments are vendor specific.

 

The following are examples of segments:

• An electrical segment wherein the logical unit tests its own electronics. The tests in this segment are vendor specific, but some examples of tests that may be included are: a buffer RAM test, a read/write circuitry test, and/or a test of the read/write head elements.

• A seek/servo segment wherein a device tests it capability to find and servo on data tracks.

• A read/verify scan segment wherein a device performs read scanning of some or all of the medium surface.

 

 

Joe L.

Link to comment

I did some searching on the electrical failure self-test error.  There are a number of hits where people have encountered this message, but I found no one that could explain the error.  The closest I found was this ...

 

The smart status Completed: electrical failure was received. This is currently not understood and therefore no operator alarm will have been raised. It may indicate problems but no concrete actions have yet been defined

 

Not exactly helpful, but seems to question whether this situation is really a serious error condition or not.  I did see that this electrical error frequently occured during a SHORT smart test as opposed to a long test.  Perhaps the OP could try and see if a short tests casuses the problem on his disk, or if only long tests cause the problem.

 

I did notice that the smart reports I saw where this error was happening seemed to have quite a few other self-test problems.  I never read about anyone getting this to go away by changing PSU or cables.

 

If the cable is bad, replacing it may help with some of the disk errors and sync errors.  If signals between the computer and hard disk are being garbled, there is no telling what symptoms it might present.  But given this newest issue of the electrical failure during a self test (and a self test does NOT use the data cable), this is looking less likely to be the problem.  Even if a new cable seemed to solve the problem, if the self test is failing, I'd have a hard time trusting the disk.

 

If the PSU, motherboard, SATA port, or something else besides the drive is causeing the problem, moving the parity disk to another PSU connector / port / cable that is known to work reliably with another drive, and rerunning the smart tests would seem to be a good thing to try.  If you can get it to reliably work in one setup, but fail in another, the drive starts to look okay and something else in the system problematic.  But if the problem follows the drive, the drive is looking like the culprit.

 

If I were to guess, though, I think the parity disk has got an obscure problem and needs to be replaced.  Although it is possible that the electrical failure message is PSU related, I see no evidence of that based on my searches.  Normally if a drive loses power, it crashes the machine and doesn't log anything.  And if drives logged (even during self tests) an electrical failure after losing power, I'd think someone would have figured that out by now.  I also think it would be reported as a "power failure" not an "electrical failure".  This is all opinion and conjecture, but my conclusion is that this electrical failure is likely some voltage that the drive is monitoring about itself and represents a failing capacitor or resistor in the electronics of the drive.  (UPDATE:  I just saw Joe L.'s post while writing this one.  This seems consistent with his findings).

 

I suggest trying to isolate the drive as described above as the next step.

Link to comment

It's still reporting the electrical fault.  I'm going to agree with Joe and Brian, that it is probably an internal thing.  It is a little surprising that the SMART report indicates "PASSED", but this is a very rare condition, first I think any of us have ever seen.  I don't think it got to the point of updating the SMART health status, before reporting the electrical failure.

 

I would start looking into replacing the drive, start the RMA process.

Link to comment

I do have another 500GB that was just RMAed to me from Seagate.  The previous one had the dreaded firmware bug that I did not take care of before it was too late.  It's a refurb though, so I was hesitant to put it in.  Looks like I have no choice now.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.