February 2, 201313 yr allright this is a bit strange, the monthly parity check this morning showed about 50,000 errors on DRIVE 1, but no readball and parity is still valid... I don't have the full log easily available (you really want a 169 day log?) but this is the significant part... Feb 1 17:01:42 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Feb 1 17:01:42 PINEWOOD kernel: ata2.00: irq_stat 0x40000001 Feb 1 17:01:42 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT Feb 1 17:01:42 PINEWOOD kernel: ata2.00: cmd 25/00:00:ff:9b:d1/00:04:1a:00:00/e0 tag 0 dma 524288 in Feb 1 17:01:42 PINEWOOD kernel: res 51/40:00:ff:9b:d1/00:04:1a:00:00/e0 Emask 0x9 (media error) Feb 1 17:01:42 PINEWOOD kernel: ata2.00: status: { DRDY ERR } Feb 1 17:01:42 PINEWOOD kernel: ata2.00: error: { UNC } Feb 1 17:01:42 PINEWOOD kernel: ata2.00: configured for UDMA/133 Feb 1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] Unhandled sense code Feb 1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 Feb 1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor] Feb 1 17:01:42 PINEWOOD kernel: Descriptor sense data with sense descriptors (in hex): Feb 1 17:01:42 PINEWOOD kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Feb 1 17:01:42 PINEWOOD kernel: 1a d1 9b ff Feb 1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] ASC=0x11 ASCQ=0x4 Feb 1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 1a d1 9b ff 00 04 00 00 Feb 1 17:01:42 PINEWOOD kernel: end_request: I/O error, dev sdb, sector 449944575 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944512/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: ata2: EH complete Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944520/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944528/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944536/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944544/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944552/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944560/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error Feb 1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944568/1, count: 1 Feb 1 17:01:42 PINEWOOD kernel: md: disk1 read error (this goes on for about 200 lines) Feb 1 17:01:46 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Feb 1 17:01:46 PINEWOOD kernel: ata2.00: irq_stat 0x40000001 Feb 1 17:01:46 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT Feb 1 17:01:46 PINEWOOD kernel: ata2.00: cmd 25/00:00:ff:9f:d1/00:04:1a:00:00/e0 tag 0 dma 524288 in Feb 1 17:01:46 PINEWOOD kernel: res 51/40:bf:37:a2:d1/00:01:1a:00:00/e0 Emask 0x9 (media error) Feb 1 17:01:46 PINEWOOD kernel: ata2.00: status: { DRDY ERR } Feb 1 17:01:46 PINEWOOD kernel: ata2.00: error: { UNC } Feb 1 17:01:46 PINEWOOD kernel: ata2.00: configured for UDMA/133 Feb 1 17:01:46 PINEWOOD kernel: ata2: EH complete Feb 1 17:01:49 PINEWOOD kernel: mdcmd (55): spindown 1 Feb 1 17:01:54 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Feb 1 17:01:54 PINEWOOD kernel: ata2.00: irq_stat 0x40000001 Feb 1 17:01:54 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT Feb 1 17:01:54 PINEWOOD kernel: ata2.00: cmd 25/00:00:ff:ab:d1/00:04:1a:00:00/e0 tag 0 dma 524288 in Feb 1 17:01:54 PINEWOOD kernel: res 51/40:af:47:ad:d1/00:02:1a:00:00/e0 Emask 0x9 (media error) Feb 1 17:01:54 PINEWOOD kernel: ata2.00: status: { DRDY ERR } Feb 1 17:01:54 PINEWOOD kernel: ata2.00: error: { UNC } Feb 1 17:01:54 PINEWOOD kernel: ata2.00: configured for UDMA/133 Feb 1 17:01:54 PINEWOOD kernel: ata2: EH complete Feb 1 17:02:10 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Feb 1 17:02:10 PINEWOOD kernel: ata2.00: irq_stat 0x40000001 Feb 1 17:02:10 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT Feb 1 17:02:10 PINEWOOD kernel: ata2.00: cmd 25/00:08:3f:07:5a/00:00:c6:00:00/e0 tag 0 dma 4096 in Feb 1 17:02:10 PINEWOOD kernel: res 51/40:08:3f:07:5a/00:00:c6:00:00/e0 Emask 0x9 (media error) Feb 1 17:02:10 PINEWOOD kernel: ata2.00: status: { DRDY ERR } Feb 1 17:02:10 PINEWOOD kernel: ata2.00: error: { UNC } Feb 1 17:02:10 PINEWOOD kernel: ata2.00: configured for UDMA/133 looked over in Disk Health, "no errors logged" Smart Info smartctl -a -d ata /dev/sdb (disk1) smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA3266419 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Feb 1 20:16:32 2013 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED -- SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 157 157 051 Pre-fail Always - 50932 3 Spin_Up_Time 0x0027 253 172 021 Pre-fail Always - 1741 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 408 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 4 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 16642 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 57 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 37 193 Load_Cycle_Count 0x0032 191 191 000 Old_age Always - 28866 194 Temperature_Celsius 0x0022 121 099 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 196 196 000 Old_age Always - 4 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 384 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 199 197 000 Old_age Offline - 287 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 16642 - # 2 Extended offline Completed without error 00% 10107 - ((points at drive age and current pending sector count)) SO, refresh my poor memory on the swap out procedure on this? Unassign Bad Drive Stop / Remove Start Stop / Put New Drive in Start Assign New Drive Watch the rebuild for a few hours (I knew something was wrong when the parity check took 18 hours)
February 2, 201313 yr You have an extra start and stop in the middle. This is only needed to use the same drive as its own replacement.
February 3, 201313 yr I would say... Stop - (power down) / Remove (failing drive, or one to be otherwise replaced) - install new drive also at this point (if available on-hand) Start - (power backup) - unRAID can run in a limited mode now running from PARITY with full access to original data - removed drive will show as missing... - - At this point you can either use unRAID till a replacement drive arrives if you did not have one, - - - or- continue with the following steps now... * * now is a good time to use the preclear_disk.sh if not already done, on the replacement drive. Stop - (ARRAY ONLY do not power down again here) Assign New Drive Start - (hopefully you precleared the drive first... right?) Watch the rebuild for a few hours - or long forced preclear process begins FIRST with an unavalaible unRAID - array on the network... before rebuild can start. :-( looking at your log and smart data, I would say you have media problems on the drive, and would suspect that the errors in the log were a result of the drive timing out during excessive time spent on the bad sector problems it has been encountering and fixing... I would not trust the drive for anything that is not replaceable at this point. I would suggest running WD diagnostics tools on the drive doing an advanced test battery on it - if you are interested in details on the drive, and then try to get an RMA approval and get swap-out replacement from Western Digital. NOTE: I love the preclear_disk.sh script! It gives me much re-assurance and peace of mind knowning that a new (or used) drive looks like it is really healthy before I start to rely on it in unRAID. Then of course the monthly parity checks also help me and my DATA stay happy! :-)
February 3, 201313 yr Start - (power backup) - unRAID can run in a limited mode now running from PARITY with full access to original data - removed drive will show as missing... Watch the rebuild for a few hours - or long forced preclear process begins FIRST with an unavalaible unRAID - array on the network... before rebuild can start. :-( The array should be available throughout the rebuild process, drive precleared or not. The lengthy array unavailability only happens when adding an unprecleared drive to an additional slot in an already parity protected array. When rebuilding, the array is simply writing the data being emulated from parity back to the drive, it doesn't care whether the drive is already full of zeros or not.
February 3, 201313 yr Thanks for the info... I guess I just have not looked into the specifics of the re-build process for a LONG time now. I just was thinking that it needed to be cleared/zeroed and have the signature writen before a re-build, so it would force it if not already precleared. It has been FAR too long ago for my mind to remember what happened when I was doing my original tests to determine how unRAID behaves under drive failures and re-builds to decide if I wanted to switch machines over to unRAID use and make it my primary, secondary, and terciary storage method of choice. :-)
February 4, 201313 yr Author all better, took about 5 hours to rebuild, all other drives look good.. my luck must be holding.. the one I pulled... still in warranty
Archived
This topic is now archived and is closed to further replies.