[SORTED] RC8, Read Errors, NO REDBALL, replace the drive?

February 2, 201313 yr

allright this is a bit strange, the monthly parity check this morning showed about 50,000 errors on DRIVE 1, but no readball and parity is still valid... I don't have the full log easily available (you really want a 169 day log?) but this is the significant part...

Feb  1 17:01:42 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb  1 17:01:42 PINEWOOD kernel: ata2.00: irq_stat 0x40000001
Feb  1 17:01:42 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT
Feb  1 17:01:42 PINEWOOD kernel: ata2.00: cmd 25/00:00:ff:9b:d1/00:04:1a:00:00/e0 tag 0 dma 524288 in
Feb  1 17:01:42 PINEWOOD kernel:          res 51/40:00:ff:9b:d1/00:04:1a:00:00/e0 Emask 0x9 (media error)
Feb  1 17:01:42 PINEWOOD kernel: ata2.00: status: { DRDY ERR }
Feb  1 17:01:42 PINEWOOD kernel: ata2.00: error: { UNC }
Feb  1 17:01:42 PINEWOOD kernel: ata2.00: configured for UDMA/133
Feb  1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] Unhandled sense code
Feb  1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=0x00 driverbyte=0x08
Feb  1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb]  Sense Key : 0x3 [current] [descriptor]
Feb  1 17:01:42 PINEWOOD kernel: Descriptor sense data with sense descriptors (in hex):
Feb  1 17:01:42 PINEWOOD kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb  1 17:01:42 PINEWOOD kernel:         1a d1 9b ff 
Feb  1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb]  ASC=0x11 ASCQ=0x4
Feb  1 17:01:42 PINEWOOD kernel: sd 1:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 1a d1 9b ff 00 04 00 00
Feb  1 17:01:42 PINEWOOD kernel: end_request: I/O error, dev sdb, sector 449944575
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944512/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: ata2: EH complete
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944520/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944528/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944536/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944544/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944552/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944560/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error
Feb  1 17:01:42 PINEWOOD kernel: handle_stripe read error: 449944568/1, count: 1
Feb  1 17:01:42 PINEWOOD kernel: md: disk1 read error

(this goes on for about 200 lines)

Feb  1 17:01:46 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb  1 17:01:46 PINEWOOD kernel: ata2.00: irq_stat 0x40000001
Feb  1 17:01:46 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT
Feb  1 17:01:46 PINEWOOD kernel: ata2.00: cmd 25/00:00:ff:9f:d1/00:04:1a:00:00/e0 tag 0 dma 524288 in
Feb  1 17:01:46 PINEWOOD kernel:          res 51/40:bf:37:a2:d1/00:01:1a:00:00/e0 Emask 0x9 (media error)
Feb  1 17:01:46 PINEWOOD kernel: ata2.00: status: { DRDY ERR }
Feb  1 17:01:46 PINEWOOD kernel: ata2.00: error: { UNC }
Feb  1 17:01:46 PINEWOOD kernel: ata2.00: configured for UDMA/133
Feb  1 17:01:46 PINEWOOD kernel: ata2: EH complete
Feb  1 17:01:49 PINEWOOD kernel: mdcmd (55): spindown 1
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: irq_stat 0x40000001
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: cmd 25/00:00:ff:ab:d1/00:04:1a:00:00/e0 tag 0 dma 524288 in
Feb  1 17:01:54 PINEWOOD kernel:          res 51/40:af:47:ad:d1/00:02:1a:00:00/e0 Emask 0x9 (media error)
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: status: { DRDY ERR }
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: error: { UNC }
Feb  1 17:01:54 PINEWOOD kernel: ata2.00: configured for UDMA/133
Feb  1 17:01:54 PINEWOOD kernel: ata2: EH complete
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: irq_stat 0x40000001
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: failed command: READ DMA EXT
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: cmd 25/00:08:3f:07:5a/00:00:c6:00:00/e0 tag 0 dma 4096 in
Feb  1 17:02:10 PINEWOOD kernel:          res 51/40:08:3f:07:5a/00:00:c6:00:00/e0 Emask 0x9 (media error)
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: status: { DRDY ERR }
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: error: { UNC }
Feb  1 17:02:10 PINEWOOD kernel: ata2.00: configured for UDMA/133

looked over in Disk Health, "no errors logged"

Smart Info

smartctl -a -d ata /dev/sdb (disk1)

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA3266419
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Feb  1 20:16:32 2013 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

--

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   157   157   051    Pre-fail  Always       -       50932
  3 Spin_Up_Time            0x0027   253   172   021    Pre-fail  Always       -       1741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       408
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       4
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       16642
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   191   191   000    Old_age   Always       -       28866
194 Temperature_Celsius     0x0022   121   099   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0032   199   199   000    Old_age   Always       -       384
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   199   197   000    Old_age   Offline      -       287

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     16642         -
# 2  Extended offline    Completed without error       00%     10107         -

((points at drive age and current pending sector count))

SO, refresh my poor memory on the swap out procedure on this?

Unassign Bad Drive

Stop / Remove

Start

Stop / Put New Drive in

Start

Assign New Drive

Watch the rebuild for a few hours

(I knew something was wrong when the parity check took 18 hours)

Quote

February 2, 201313 yr

You have an extra start and stop in the middle. This is only needed to use the same drive as its own replacement.

Quote

February 3, 201313 yr

I would say...

Stop - (power down) / Remove (failing drive, or one to be otherwise replaced)

- install new drive also at this point (if available on-hand)

Start - (power backup)

- unRAID can run in a limited mode now running from PARITY with full access to original data

- removed drive will show as missing...

- - At this point you can either use unRAID till a replacement drive arrives if you did not have one,

- - - or- continue with the following steps now...

* * now is a good time to use the preclear_disk.sh if not already done, on the replacement drive.

Stop - (ARRAY ONLY do not power down again here)

Assign New Drive

Start - (hopefully you precleared the drive first... right?)

Watch the rebuild for a few hours

- or long forced preclear process begins FIRST with an unavalaible unRAID

- array on the network... before rebuild can start. :-(

looking at your log and smart data, I would say you have media problems on the drive, and would suspect that the errors in the log were a result of the drive timing out during excessive time spent on the bad sector problems it has been encountering and fixing... I would not trust the drive for anything that is not replaceable at this point. I would suggest running WD diagnostics tools on the drive doing an advanced test battery on it - if you are interested in details on the drive, and then try to get an RMA approval and get swap-out replacement from Western Digital.

NOTE:

I love the preclear_disk.sh script! It gives me much re-assurance and peace of mind knowning

that a new (or used) drive looks like it is really healthy before I start to rely on it in unRAID.

Then of course the monthly parity checks also help me and my DATA stay happy! :-)

Quote

February 3, 201313 yr

Start - (power backup)

- unRAID can run in a limited mode now running from PARITY with full access to original data

- removed drive will show as missing...

Watch the rebuild for a few hours

- or long forced preclear process begins FIRST with an unavalaible unRAID

- array on the network... before rebuild can start. :-(

The array should be available throughout the rebuild process, drive precleared or not. The lengthy array unavailability only happens when adding an unprecleared drive to an additional slot in an already parity protected array. When rebuilding, the array is simply writing the data being emulated from parity back to the drive, it doesn't care whether the drive is already full of zeros or not.

Quote

February 3, 201313 yr

Thanks for the info... I guess I just have not looked into the specifics of the re-build process for a LONG time now. I just was thinking that it needed to be cleared/zeroed and have the signature writen before a re-build, so it would force it if not already precleared. It has been FAR too long ago for my mind to remember what happened when I was doing my original tests to determine how unRAID behaves under drive failures and re-builds to decide if I wanted to switch machines over to unRAID use and make it my primary, secondary, and terciary storage method of choice.

:-)

Quote

February 4, 201313 yr

Author

all better, took about 5 hours to rebuild, all other drives look good.. my luck must be holding.. the one I pulled... still in warranty

Quote

[SORTED] RC8, Read Errors, NO REDBALL, replace the drive?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)