Lost 1 drive, havoc ensued

May 22, 200917 yr

Author

more smart report

Quote

May 22, 200917 yr

Doh, I did not copy the syslog when I did the parity check. That was dumb. But here's the smart report for the parity drive. Bold entries are the interesting ones. Reported_Uncorrect is up to 56 from 0 while Current_Pending_Sector and Offline_Uncorrectable went from 10 to 0.

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 114 100 006 Pre-fail Always - 76607144

3 Spin_Up_Time 0x0003 094 091 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 380

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0

7 Seek_Error_Rate 0x000f 068 060 030 Pre-fail Always - 6830037

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 558

10 Spin_Retry_Count 0x0013 100 099 097 Pre-fail Always - 26

12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 242

184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 044 044 000 Old_age Always - 56

188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 074 051 045 Old_age Always - 26 (Lifetime Min/Max 24/26)

194 Temperature_Celsius 0x0022 026 049 000 Old_age Always - 26 (0 15 0 0)

195 Hardware_ECC_Recovered 0x001a 051 032 000 Old_age Always - 76607144

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

Based on the smart reports, I am pretty sure that your data cable from your SATA port to your drive is bad. If the drive was bad, all of those disk errors found during the parity check would have meant reallocated disk sectors. But when the cable is not getting good connection, commands sent from the computer to the drive get garbled. Often the commands appear invalid to the drive, and it logs errors in the smart log (notice you have 56 such error logs).

Although I say it is most likely the cable, it could be other things in the signal path. For example, if you are using a backplane or other type of disk docking mechanism, the problem could be inside of that. In rare situations the physical port may be bad (I had a finicky port that needed a locking SATA cable to create a good consistant connection).

I recommend a new data cable. If you have a spare port I would suggest trying that as well. If a new port and new data cable solves the problem, you at least know it isn't the drive. Cables are cheap, just throw the suspect one out. If the port is the problem, get a locking cable.

I do note that it seems odd that your pending sectors corrected themselves. Normally pending bad sectors become reallocated sectors - but occasionally when they get tested the final time and work they are put back in service. I can't explain why that happened, but would recommend keeping a close eye on the parity drive to see if bad sectors start cropping up on subsequent parity checks.

Quote

May 22, 200917 yr

I would also suggest a SMART long test on the parity drive, see the bottom of the Obtaining a SMART report section. First, hook it up using your best 80-wire flat IDE ribbon cable, no more than 24 inch.

The ATA Error Count of 56 does not account for the 961 errors reported by unRAID. I really wish we had that syslog.

Quote

May 22, 200917 yr

I would also suggest a SMART long test on the parity drive, see the bottom of the Obtaining a SMART report section. First, hook it up using your best 80-wire flat IDE ribbon cable, no more than 24 inch.

I'm pretty sure the parity disk is a SATA drive, not IDE.

The ATA Error Count of 56 does not account for the 961 errors reported by unRAID. I really wish we had that syslog.

I noticed that too. Bit of a math problem. But since there is no indication of actual disk errors, I have to beleive it is a bad cable connection. Perhaps some of the read requests never registered with the drive?

Quote

May 22, 200917 yr

I'm pretty sure the parity disk is a SATA drive, not IDE.

You are right, guess I was thinking of hda. Disregard my cable comment.

But since there is no indication of actual disk errors, I have to beleive it is a bad cable connection. Perhaps some of the read requests never registered with the drive?

I don't know, it is not behaving normally at all. I am not sure what those Reported_Uncorrect errors are, docs aren't very helpful. And I'm still somewhat uncomfortable in concluding yet that this is a cable problem, because I have not seen a single one of the usual errors associated with cabling issues. Perhaps they were in the syslog ... These Uncorrectable errors and the DMA timeout errors are not really specific enough, or documented well enough, to indicate what is the real problem. And another thing that is troubling me, is that I don't see how any of this could be related to the crashes or freezes. The DMA errors certainly could, if they are actually happening on the system side, not the drive side. If the DMA errors were occurring on the drives, or because of cabling issues, then I don't see how they can be related to the crashes AT ALL.

Quote

May 22, 200917 yr

I don't know, it is not behaving normally at all. I am not sure what those Reported_Uncorrect errors are, docs aren't very helpful. And I'm still somewhat uncomfortable in concluding yet that this is a cable problem, because I have not seen a single one of the usual errors associated with cabling issues. Perhaps they were in the syslog ... These Uncorrectable errors and the DMA timeout errors are not really specific enough, or documented well enough, to indicate what is the real problem. And another thing that is troubling me, is that I don't see how any of this could be related to the crashes or freezes. The DMA errors certainly could, if they are actually happening on the system side, not the drive side. If the DMA errors were occurring on the drives, or because of cabling issues, then I don't see how they can be related to the crashes AT ALL.

You may be right. Normally those disk error logs are indicators of cabling issues, but each one seems to be failing on the same command ...

27 00 00 00 00 00 e0 00 01:44:52.943 READ NATIVE MAX ADDRESS EXT

I don't know too much about DMA and whether that would mean a motherboard issue of a drive issue.

If it were me, I'd rule out the cabling problem first. If these problems continued after trying new cables I'd start to dig deeper. Capturing a syslog while doing a parity check would be a good place to start.

Quote

May 22, 200917 yr

Author

I recommend a new data cable. If you have a spare port I would suggest trying that as well. If a new port and new data cable solves the problem, you at least know it isn't the drive.

I have new cables coming in. (1) 10 inch short cable, and another 18inch locking type. What would constitute "new data cable solves the problem"? Would the Reported_Uncorrect stay the same or go back to 0?

Since we looked at so many different possible scenarios, I'm not sure what is actually at fault now. I heard drive powering down then up in a split second, possible sector errors that cleared up, and sync and write errors on the parity drive. If it's just the cable for the parity drive, then replacing that would mean happily ever after? Going forward, what else should I look out for?

Another parity check just finished, 0 errors. I'm attaching the syslog after I ran the second parity check. Probably not useful, since there were no errors.

Currently running a long smart report.

Quote

May 22, 200917 yr

Author

Actually, it seem the long smart test alrady finished? The rerpot says status compeleted. did I read that right? It only took minutes.

I ran:

root@unRAID:/boot# smartctl -d ata -tlong /dev/sda

Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".

Drive command "Execute SMART Extended self-test routine immediately in off-linemode" successful.

Testing has begun.

Please wait 120 minutes for test to complete.

Test will complete after Thu May 21 23:15:39 2009

Use smartctl -X to abort test.

root@unRAID:/boot# smartctl -a -d ata /dev/sda > smart.txt

Quote

May 22, 200917 yr

Your smart report contains this:

[pre]

Self-test execution status: ( 89) The previous self-test completed having

the electrical element of the test

failed.

[/pre] Normally it would have taken 120 minutes or so to complete as it printed when you invoked the long test.

Instead, it failed and aborted the test.

I don't have any experience to know if this indicates a failure internal to the drive, or a cabling/power supply could still possibly be the cause...

Quote

May 22, 200917 yr

At last we have forced something serious to the surface. This could easily explain many or all of the issues you have been having. As Joe said though, we don't yet know *where* the problem is, but at least we have a good idea *what* the problem is. Some possibilities:

* defective drive

* defective or loose power cable to the drive

* defective or loose power splitter or Molex-to-SATA adapter

* bad power supply

* power problems on the motherboard?

Quote

May 22, 200917 yr

At last we have forced something serious to the surface. This could easily explain many or all of the issues you have been having. As Joe said though, we don't yet know *where* the problem is, but at least we have a good idea *what* the problem is. Some possibilities:

* defective drive

* defective or loose power cable to the drive

* defective or loose power splitter or Molex-to-SATA adapter

* bad power supply

* power problems on the motherboard?

Or... it could be internal to the drive.

I found this quote here

The criteria for the short self-test are that it has one or more segments and completes in two minutes or less. The criteria for the extended self-test are that it is has one or more segments and that the completion time is vendor specific. Any tests performed in the segments are vendor specific.

The following are examples of segments:

• An electrical segment wherein the logical unit tests its own electronics. The tests in this segment are vendor specific, but some examples of tests that may be included are: a buffer RAM test, a read/write circuitry test, and/or a test of the read/write head elements.

• A seek/servo segment wherein a device tests it capability to find and servo on data tracks.

• A read/verify scan segment wherein a device performs read scanning of some or all of the medium surface.

Joe L.

Quote

May 22, 200917 yr

I did some searching on the electrical failure self-test error. There are a number of hits where people have encountered this message, but I found no one that could explain the error. The closest I found was this ...

The smart status Completed: electrical failure was received. This is currently not understood and therefore no operator alarm will have been raised. It may indicate problems but no concrete actions have yet been defined

Not exactly helpful, but seems to question whether this situation is really a serious error condition or not. I did see that this electrical error frequently occured during a SHORT smart test as opposed to a long test. Perhaps the OP could try and see if a short tests casuses the problem on his disk, or if only long tests cause the problem.

I did notice that the smart reports I saw where this error was happening seemed to have quite a few other self-test problems. I never read about anyone getting this to go away by changing PSU or cables.

If the cable is bad, replacing it may help with some of the disk errors and sync errors. If signals between the computer and hard disk are being garbled, there is no telling what symptoms it might present. But given this newest issue of the electrical failure during a self test (and a self test does NOT use the data cable), this is looking less likely to be the problem. Even if a new cable seemed to solve the problem, if the self test is failing, I'd have a hard time trusting the disk.

If the PSU, motherboard, SATA port, or something else besides the drive is causeing the problem, moving the parity disk to another PSU connector / port / cable that is known to work reliably with another drive, and rerunning the smart tests would seem to be a good thing to try. If you can get it to reliably work in one setup, but fail in another, the drive starts to look okay and something else in the system problematic. But if the problem follows the drive, the drive is looking like the culprit.

If I were to guess, though, I think the parity disk has got an obscure problem and needs to be replaced. Although it is possible that the electrical failure message is PSU related, I see no evidence of that based on my searches. Normally if a drive loses power, it crashes the machine and doesn't log anything. And if drives logged (even during self tests) an electrical failure after losing power, I'd think someone would have figured that out by now. I also think it would be reported as a "power failure" not an "electrical failure". This is all opinion and conjecture, but my conclusion is that this electrical failure is likely some voltage that the drive is monitoring about itself and represents a failing capacitor or resistor in the electronics of the drive. (UPDATE: I just saw Joe L.'s post while writing this one. This seems consistent with his findings).

I suggest trying to isolate the drive as described above as the next step.

Quote

May 24, 200917 yr

Author

I put in a new locking cable and the power is coming off of a Y that works on another drive, so I'm ruling out the power adapter. After replacing the cable, and running another long smart report, it's ended really early just like before.

Quote

May 24, 200917 yr

It's still reporting the electrical fault. I'm going to agree with Joe and Brian, that it is probably an internal thing. It is a little surprising that the SMART report indicates "PASSED", but this is a very rare condition, first I think any of us have ever seen. I don't think it got to the point of updating the SMART health status, before reporting the electrical failure.

I would start looking into replacing the drive, start the RMA process.

Quote

May 24, 200917 yr

Author

I do have another 500GB that was just RMAed to me from Seagate. The previous one had the dreaded firmware bug that I did not take care of before it was too late. It's a refurb though, so I was hesitant to put it in. Looks like I have no choice now.

Quote

Lost 1 drive, havoc ensued

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)