Jump to content

hardware issue?


Recommended Posts

Posted

Please help with this... for now I've just posted the section of my log that I'm worried about (Every now and again I get a bunch of this kind of error showing), entire syslog is attached.

 

Parity checks are all clean.

 

Dec 22 00:13:37 RCNAS kernel: ata11.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

Dec 22 00:13:37 RCNAS kernel: ata11.01: BMDMA stat 0x64 (Drive related)

Dec 22 00:13:37 RCNAS kernel: ata11.01: failed command: READ DMA EXT (Minor Issues)

Dec 22 00:13:37 RCNAS kernel: ata11.01: cmd 25/00:00:17:ef:d2/00:02:2b:00:00/f0 tag 0 dma 262144 in (Drive related)

Dec 22 00:13:37 RCNAS kernel:          res 51/40:00:2d:ef:d2/40:00:2b:00:00/f0 Emask 0x9 (media error) (Errors)

Dec 22 00:13:37 RCNAS kernel: ata11.01: status: { DRDY ERR } (Drive related)

Dec 22 00:13:37 RCNAS kernel: ata11.01: error: { UNC } (Errors)

Dec 22 00:13:38 RCNAS kernel: ata11.00: configured for UDMA/133 (Drive related)

Dec 22 00:13:38 RCNAS kernel: ata11.01: configured for UDMA/133 (Drive related)

and on and on... see attachment for full syslog.... (note the missing lines in the syslog are just the mover script logs - you don't need to see the types of files I keep do you?).

 

I believe that the ata11.01 is the cache drive.

 

Should I be worried?  The cache drive is on the motherboard controller.

Disk devices

parity device: pci-0000:00:1f.2-scsi-1:0:1:0 host12 (sdj) WDC_WD20EARS-00MVWB0_WD-WMAZA3407269

disk1 device: pci-0000:01:00.0-scsi-1:0:0:0 host1 (sdb) WDC_WD10EACS-00D6B0_WD-WCAU40384147

disk2 device: pci-0000:00:1f.2-scsi-0:0:0:0 host11 (sdh) WDC_WD10EAVS-00D7B1_WD-WCAU46190122

disk3 device: pci-0000:01:00.0-scsi-2:0:0:0 host2 (sdc) WDC_WD10EADS-00L5B1_WD-WCAU46192923

disk4 device: pci-0000:01:00.0-scsi-3:0:0:0 host3 (sdd) WDC_WD10EADS-00M2B0_WD-WMAV50454466

disk5 device: pci-0000:02:00.0-scsi-1:0:0:0 host6 (sde) WDC_WD10EADS-00M2B0_WD-WMAV50297857

disk6 device: pci-0000:04:02.0-scsi-3:0:0:0 host10 (sdg) WDC_WD15EARS-00MVWB0_WD-WCAZA2550600

disk7 device: pci-0000:01:00.0-scsi-0:0:0:0 host0 (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA3269017

disk8 device: unassigned

disk9 device: unassigned

disk10 device: unassigned

disk11 device: unassigned

disk12 device: unassigned

disk13 device: unassigned

disk14 device: unassigned

disk15 device: unassigned

disk16 device: unassigned

disk17 device: unassigned

disk18 device: unassigned

disk19 device: unassigned

disk20 device: unassigned

cache device: pci-0000:00:1f.2-scsi-0:0:1:0 host11 (sdi) WDC_WD1001FALS-00J7B0_WD-WMATV0910106

 

Thanks for your help!

syslog-2011-12-26.zip

Posted

Thought I'd add the smart report for the cache drive (the one with the possible issue, i assume?):

SMART status Info for /dev/sdi

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF INFORMATION SECTION ===

Device Model:    WDC WD1001FALS-00J7B0

Serial Number:    WD-WMATV0910106

Firmware Version: 05.00K05

User Capacity:    1,000,204,886,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Mon Dec 26 16:09:08 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (19200) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 221) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x303f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      65

  3 Spin_Up_Time            0x0027  236  232  021    Pre-fail  Always      -      8200

  4 Start_Stop_Count        0x0032  097  097  000    Old_age  Always      -      3146

  5 Reallocated_Sector_Ct  0x0033  199  199  140    Pre-fail  Always      -      3

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  068  068  000    Old_age  Always      -      23866

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      77

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      13

193 Load_Cycle_Count        0x0032  199  199  000    Old_age  Always      -      3146

194 Temperature_Celsius    0x0022  117  109  000    Old_age  Always      -      33

196 Reallocated_Event_Count 0x0032  197  197  000    Old_age  Always      -      3

197 Current_Pending_Sector  0x0032  195  195  000    Old_age  Always      -      852

198 Offline_Uncorrectable  0x0030  200  197  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  174  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    16746        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Posted

your drive is dying,

 

197 Current_Pending_Sector  0x0032  195  195  000    Old_age  Always      -      852

There are 852 unreadable sectors, pending re-allocation when next written to.

 

Time to RMA it.  (those are the "media errors" in your first post)

Posted

Forget my last post... its out of warranty (I don't have receipt so they use the manufacture date).

I see Joe's rep on here is immaculate so he isn't likely to be disagreed with ;) so Off to the store I go :)

 

Thanks again!

Posted

One more question Joe (or anyone) please...

 

So I replaced the drive with another spare I had to RMA it (turned out to be under warranty after all)... before sending it in I decided to do a pre-clear on it to see how many of the pending sectors would switch to reallocated.

Zero did.

 

after the pre-clear it said:

0 sectors are pending re-allocation at the end of the preclear a change of -852 in the number of sectors pending re-allocation.

 

Would this indicate the drive is actually ok?

 

What should I tell WD if I RMA it?

 

Thanks!

Posted

One more question Joe (or anyone) please...

 

So I replaced the drive with another spare I had to RMA it (turned out to be under warranty after all)... before sending it in I decided to do a pre-clear on it to see how many of the pending sectors would switch to reallocated.

Zero did.

 

after the pre-clear it said:

0 sectors are pending re-allocation at the end of the preclear a change of -852 in the number of sectors pending re-allocation.

 

Would this indicate the drive is actually ok?

 

What should I tell WD if I RMA it?

 

Thanks!

Were they re-allocated?  Or, were they successfully re-written in place?

 

if re-written in place, then I would suspect the drive OR the power supply.  It simply re-allocated, then yes, RMA it.

 

You need to look now at a current SMART report for that drive.

Posted

One more question Joe (or anyone) please...

 

So I replaced the drive with another spare I had to RMA it (turned out to be under warranty after all)... before sending it in I decided to do a pre-clear on it to see how many of the pending sectors would switch to reallocated.

Zero did.

 

after the pre-clear it said:

0 sectors are pending re-allocation at the end of the preclear a change of -852 in the number of sectors pending re-allocation.

 

Would this indicate the drive is actually ok?

 

What should I tell WD if I RMA it?

 

Thanks!

Were they re-allocated?  Or, were they successfully re-written in place?

 

if re-written in place, then I would suspect the drive OR the power supply.  It simply re-allocated, then yes, RMA it.

 

You need to look now at a current SMART report for that drive.

 

Just sorta jumping in here (following this thread for educational purposes) but isn't this something the pre-clear script should pick up?  As in seeing that the reallocated sector count has now gone up?  Or is the logic too difficult to script thus requiring a human to look at it?  In which case, it might be a good idea to tell the user, "something changed, there are the possibilities, go check the SMART report."  Or something to clue them in?

Posted

Just sorta jumping in here (following this thread for educational purposes) but isn't this something the pre-clear script should pick up?  As in seeing that the reallocated sector count has now gone up?  Or is the logic too difficult to script thus requiring a human to look at it?  In which case, it might be a good idea to tell the user, "something changed, there are the possibilities, go check the SMART report."  Or something to clue them in?

It would have.... but marcusone  elected to only post one line from the final report, and not the entire report.

 

Therefore, we cannot tell, as our psychic skills are a bit rusty this late in the year. ;)

 

I really have no way to tell how a manufacturer reacts when a specific drive is returned.  I've seen people return a drive with only a few re-allocated sectors.  I honestly doubt the manufacturers have the time to verify the returned drives when in an RMA process.  They would just rather you not return a working drive.

 

If you have doubt, RMA a drive, especially if it had over 800 sectors it apparently either re-allocated because they could not be read, or re-written in place because they were not able to be read when written the first time.  (800 sectors would probably not cause a SMART failure, as most drives have several thousand spare sectors, but it is a certain clue that more sectors will fail early in the drive's life)

Posted

Just sorta jumping in here (following this thread for educational purposes) but isn't this something the pre-clear script should pick up?  As in seeing that the reallocated sector count has now gone up?  Or is the logic too difficult to script thus requiring a human to look at it?  In which case, it might be a good idea to tell the user, "something changed, there are the possibilities, go check the SMART report."  Or something to clue them in?

It would have.... but marcusone  elected to only post one line from the final report, and not the entire report.

 

 

Fair enough, I just figured if there had been a blinking, flashing, screaming, bolded, airplane-towed banner in the report he would have included it. As such I assumed it was either not there, or just slightly more subtle ;)

Posted

So how do I determine if its the power supply or the hard drive?

 

I'm using the same power supply as the LimeTech built rigs have. "Corsair CMPSU-650TX 650W ATX12V / EPS12V" which I put in not even a year ago.

Posted

These lines summed it up:

No SMART attributes are FAILING_NOW

852 sectors were pending re-allocation before the start of the preclear.

852 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

    a change of -852 in the number of sectors pending re-allocation.

3 sectors had been re-allocated before the start of the preclear.

3 sectors are re-allocated at the end of the preclear,

    the number of sectors re-allocated did not change.

 

So, every sector that could not be read and were pending re-allocation were able to be read once re-written in place.

 

Your power supply has a single 52 Ampere 12 volt rail, so its capacity should be OK.  That leaves temperature, vibration, poor quality voltage regulation (bad power supply splitters, back-plane, etc) or a disk sensitive to environmental factors. 

 

Was the disk used in another PC first?  How did it get 852 unreadable sectors?  It appears as if they were marked as un-readable in a prior use?  Perhaps the disk is fine in the unRAID server, but horrible in its prior use?

 

Joe L.

 

 

Posted

It has been the cache drive in the unraid box for 6+ months (I did a preclear before I put it in, and didn't have those 800+ pending then)... I think it always had the 3 "bad" sectors it still reports.

 

Can dust cause an issue?  it was a little dusty when I pulled it out (cleaned it and all the filters in the case before doing the preclear that you now have the reports for).

 

I'll check my power splitters; but if I remember correctly, don't use any (all direct from power supply to drive or hotswap cage).

The drive I replaced it with and I'm now using for a cache drive is in the same hotswap bay (so if its the back plane of the hotswap bay it should cause that drive to have issues... in theory anyways?).

 

Temp never goes above 33 in the case that the drive is normally in (basement with fans running over all the hard drives).

 

Thanks for your input Joe... I love how active you are with unraid!

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...