marcusone Posted December 26, 2011 Posted December 26, 2011 Please help with this... for now I've just posted the section of my log that I'm worried about (Every now and again I get a bunch of this kind of error showing), entire syslog is attached. Parity checks are all clean. Dec 22 00:13:37 RCNAS kernel: ata11.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) Dec 22 00:13:37 RCNAS kernel: ata11.01: BMDMA stat 0x64 (Drive related) Dec 22 00:13:37 RCNAS kernel: ata11.01: failed command: READ DMA EXT (Minor Issues) Dec 22 00:13:37 RCNAS kernel: ata11.01: cmd 25/00:00:17:ef:d2/00:02:2b:00:00/f0 tag 0 dma 262144 in (Drive related) Dec 22 00:13:37 RCNAS kernel: res 51/40:00:2d:ef:d2/40:00:2b:00:00/f0 Emask 0x9 (media error) (Errors) Dec 22 00:13:37 RCNAS kernel: ata11.01: status: { DRDY ERR } (Drive related) Dec 22 00:13:37 RCNAS kernel: ata11.01: error: { UNC } (Errors) Dec 22 00:13:38 RCNAS kernel: ata11.00: configured for UDMA/133 (Drive related) Dec 22 00:13:38 RCNAS kernel: ata11.01: configured for UDMA/133 (Drive related) and on and on... see attachment for full syslog.... (note the missing lines in the syslog are just the mover script logs - you don't need to see the types of files I keep do you?). I believe that the ata11.01 is the cache drive. Should I be worried? The cache drive is on the motherboard controller. Disk devices parity device: pci-0000:00:1f.2-scsi-1:0:1:0 host12 (sdj) WDC_WD20EARS-00MVWB0_WD-WMAZA3407269 disk1 device: pci-0000:01:00.0-scsi-1:0:0:0 host1 (sdb) WDC_WD10EACS-00D6B0_WD-WCAU40384147 disk2 device: pci-0000:00:1f.2-scsi-0:0:0:0 host11 (sdh) WDC_WD10EAVS-00D7B1_WD-WCAU46190122 disk3 device: pci-0000:01:00.0-scsi-2:0:0:0 host2 (sdc) WDC_WD10EADS-00L5B1_WD-WCAU46192923 disk4 device: pci-0000:01:00.0-scsi-3:0:0:0 host3 (sdd) WDC_WD10EADS-00M2B0_WD-WMAV50454466 disk5 device: pci-0000:02:00.0-scsi-1:0:0:0 host6 (sde) WDC_WD10EADS-00M2B0_WD-WMAV50297857 disk6 device: pci-0000:04:02.0-scsi-3:0:0:0 host10 (sdg) WDC_WD15EARS-00MVWB0_WD-WCAZA2550600 disk7 device: pci-0000:01:00.0-scsi-0:0:0:0 host0 (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA3269017 disk8 device: unassigned disk9 device: unassigned disk10 device: unassigned disk11 device: unassigned disk12 device: unassigned disk13 device: unassigned disk14 device: unassigned disk15 device: unassigned disk16 device: unassigned disk17 device: unassigned disk18 device: unassigned disk19 device: unassigned disk20 device: unassigned cache device: pci-0000:00:1f.2-scsi-0:0:1:0 host11 (sdi) WDC_WD1001FALS-00J7B0_WD-WMATV0910106 Thanks for your help! syslog-2011-12-26.zip
marcusone Posted December 26, 2011 Author Posted December 26, 2011 Thought I'd add the smart report for the cache drive (the one with the possible issue, i assume?): SMART status Info for /dev/sdi smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD1001FALS-00J7B0 Serial Number: WD-WMATV0910106 Firmware Version: 05.00K05 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Dec 26 16:09:08 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (19200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 221) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 65 3 Spin_Up_Time 0x0027 236 232 021 Pre-fail Always - 8200 4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3146 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 3 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23866 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 77 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3146 194 Temperature_Celsius 0x0022 117 109 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 197 197 000 Old_age Always - 3 197 Current_Pending_Sector 0x0032 195 195 000 Old_age Always - 852 198 Offline_Uncorrectable 0x0030 200 197 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 174 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 16746 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Joe L. Posted December 26, 2011 Posted December 26, 2011 your drive is dying, 197 Current_Pending_Sector 0x0032 195 195 000 Old_age Always - 852 There are 852 unreadable sectors, pending re-allocation when next written to. Time to RMA it. (those are the "media errors" in your first post)
marcusone Posted December 27, 2011 Author Posted December 27, 2011 What I was afraid you would say... is WD good to deal with for RMA? or should I just buy a new one and save myself the hassle?
marcusone Posted December 27, 2011 Author Posted December 27, 2011 Forget my last post... its out of warranty (I don't have receipt so they use the manufacture date). I see Joe's rep on here is immaculate so he isn't likely to be disagreed with so Off to the store I go Thanks again!
marcusone Posted December 29, 2011 Author Posted December 29, 2011 One more question Joe (or anyone) please... So I replaced the drive with another spare I had to RMA it (turned out to be under warranty after all)... before sending it in I decided to do a pre-clear on it to see how many of the pending sectors would switch to reallocated. Zero did. after the pre-clear it said: 0 sectors are pending re-allocation at the end of the preclear a change of -852 in the number of sectors pending re-allocation. Would this indicate the drive is actually ok? What should I tell WD if I RMA it? Thanks!
Joe L. Posted December 29, 2011 Posted December 29, 2011 One more question Joe (or anyone) please... So I replaced the drive with another spare I had to RMA it (turned out to be under warranty after all)... before sending it in I decided to do a pre-clear on it to see how many of the pending sectors would switch to reallocated. Zero did. after the pre-clear it said: 0 sectors are pending re-allocation at the end of the preclear a change of -852 in the number of sectors pending re-allocation. Would this indicate the drive is actually ok? What should I tell WD if I RMA it? Thanks! Were they re-allocated? Or, were they successfully re-written in place? if re-written in place, then I would suspect the drive OR the power supply. It simply re-allocated, then yes, RMA it. You need to look now at a current SMART report for that drive.
jumperalex Posted December 29, 2011 Posted December 29, 2011 One more question Joe (or anyone) please... So I replaced the drive with another spare I had to RMA it (turned out to be under warranty after all)... before sending it in I decided to do a pre-clear on it to see how many of the pending sectors would switch to reallocated. Zero did. after the pre-clear it said: 0 sectors are pending re-allocation at the end of the preclear a change of -852 in the number of sectors pending re-allocation. Would this indicate the drive is actually ok? What should I tell WD if I RMA it? Thanks! Were they re-allocated? Or, were they successfully re-written in place? if re-written in place, then I would suspect the drive OR the power supply. It simply re-allocated, then yes, RMA it. You need to look now at a current SMART report for that drive. Just sorta jumping in here (following this thread for educational purposes) but isn't this something the pre-clear script should pick up? As in seeing that the reallocated sector count has now gone up? Or is the logic too difficult to script thus requiring a human to look at it? In which case, it might be a good idea to tell the user, "something changed, there are the possibilities, go check the SMART report." Or something to clue them in?
Joe L. Posted December 29, 2011 Posted December 29, 2011 Just sorta jumping in here (following this thread for educational purposes) but isn't this something the pre-clear script should pick up? As in seeing that the reallocated sector count has now gone up? Or is the logic too difficult to script thus requiring a human to look at it? In which case, it might be a good idea to tell the user, "something changed, there are the possibilities, go check the SMART report." Or something to clue them in? It would have.... but marcusone elected to only post one line from the final report, and not the entire report. Therefore, we cannot tell, as our psychic skills are a bit rusty this late in the year. I really have no way to tell how a manufacturer reacts when a specific drive is returned. I've seen people return a drive with only a few re-allocated sectors. I honestly doubt the manufacturers have the time to verify the returned drives when in an RMA process. They would just rather you not return a working drive. If you have doubt, RMA a drive, especially if it had over 800 sectors it apparently either re-allocated because they could not be read, or re-written in place because they were not able to be read when written the first time. (800 sectors would probably not cause a SMART failure, as most drives have several thousand spare sectors, but it is a certain clue that more sectors will fail early in the drive's life)
jumperalex Posted December 29, 2011 Posted December 29, 2011 Just sorta jumping in here (following this thread for educational purposes) but isn't this something the pre-clear script should pick up? As in seeing that the reallocated sector count has now gone up? Or is the logic too difficult to script thus requiring a human to look at it? In which case, it might be a good idea to tell the user, "something changed, there are the possibilities, go check the SMART report." Or something to clue them in? It would have.... but marcusone elected to only post one line from the final report, and not the entire report. Fair enough, I just figured if there had been a blinking, flashing, screaming, bolded, airplane-towed banner in the report he would have included it. As such I assumed it was either not there, or just slightly more subtle
marcusone Posted December 29, 2011 Author Posted December 29, 2011 Sorry here is the preclear reports preclear_reports.zip
marcusone Posted December 29, 2011 Author Posted December 29, 2011 So how do I determine if its the power supply or the hard drive? I'm using the same power supply as the LimeTech built rigs have. "Corsair CMPSU-650TX 650W ATX12V / EPS12V" which I put in not even a year ago.
Joe L. Posted December 29, 2011 Posted December 29, 2011 These lines summed it up: No SMART attributes are FAILING_NOW 852 sectors were pending re-allocation before the start of the preclear. 852 sectors were pending re-allocation after pre-read in cycle 1 of 1. 0 sectors were pending re-allocation after zero of disk in cycle 1 of 1. 0 sectors are pending re-allocation at the end of the preclear, a change of -852 in the number of sectors pending re-allocation. 3 sectors had been re-allocated before the start of the preclear. 3 sectors are re-allocated at the end of the preclear, the number of sectors re-allocated did not change. So, every sector that could not be read and were pending re-allocation were able to be read once re-written in place. Your power supply has a single 52 Ampere 12 volt rail, so its capacity should be OK. That leaves temperature, vibration, poor quality voltage regulation (bad power supply splitters, back-plane, etc) or a disk sensitive to environmental factors. Was the disk used in another PC first? How did it get 852 unreadable sectors? It appears as if they were marked as un-readable in a prior use? Perhaps the disk is fine in the unRAID server, but horrible in its prior use? Joe L.
marcusone Posted December 29, 2011 Author Posted December 29, 2011 It has been the cache drive in the unraid box for 6+ months (I did a preclear before I put it in, and didn't have those 800+ pending then)... I think it always had the 3 "bad" sectors it still reports. Can dust cause an issue? it was a little dusty when I pulled it out (cleaned it and all the filters in the case before doing the preclear that you now have the reports for). I'll check my power splitters; but if I remember correctly, don't use any (all direct from power supply to drive or hotswap cage). The drive I replaced it with and I'm now using for a cache drive is in the same hotswap bay (so if its the back plane of the hotswap bay it should cause that drive to have issues... in theory anyways?). Temp never goes above 33 in the case that the drive is normally in (basement with fans running over all the hard drives). Thanks for your input Joe... I love how active you are with unraid!
Recommended Posts
Archived
This topic is now archived and is closed to further replies.