My first read-errors from my array


Recommended Posts

I feel like I'm starting to feel a little of the age of my array (which isn't all that old - 2 years). I am seeing multiple errors, which I've never had before on any of these drives. Parity drive has 976, then 2 separate data drives, one with 113 and the other with 492. I literally have until March 11th to RMA them if I need to go that route, but can't be without all 3 at the same time, and don't particularly want to go purchase 3 new drives either  :-\

 

Things to note:

  • These drives are part of a 5 disk array + parity + cache drive (7 total)
  • sdg, sdd, and sdf are affected
  • sdf and sdg do not show Temp (just a *)
  • About a month ago I put in new CPU/mobo/RAM and upgraded to 6
  • The last parity check was performed 03-04-2015 with 0 errors and averaged 71.7 MB/sec

 

Smart logs:

 

# cat smart-sdf-03082015.txt

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.17.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /6:0:0:0
Product:              
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

# cat smart-sdg-03082015.txt

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.17.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /6:0:1:0
Product:              
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
Physical block size:  1549687900 bytes
Lowest aligned LBA:   14896
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

# cat smart-sdd-03082015.txt

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.17.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Serial Number:    Z1F2269Z
LU WWN Device Id: 5 000c50 04f673cfa
Firmware Version: CC24
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar  8 23:19:37 2015 CDT

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(  575) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				No Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 334) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   107   079   006    Pre-fail  Always       -       129433296
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   094   094   020    Old_age   Always       -       6155
  5 Reallocated_Sector_Ct   0x0033   085   085   010    Pre-fail  Always       -       19592
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       9662636
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16784
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   098   098   020    Old_age   Always       -       2467
183 Runtime_Bad_Block       0x0032   061   061   000    Old_age   Always       -       39
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       9 9 11
189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2
190 Airflow_Temperature_Cel 0x0022   070   058   045    Old_age   Always       -       30 (Min/Max 21/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2439
193 Load_Cycle_Count        0x0032   086   086   000    Old_age   Always       -       28902
194 Temperature_Celsius     0x0022   030   042   000    Old_age   Always       -       30 (0 14 0 0 0)
197 Current_Pending_Sector  0x0012   099   099   000    Old_age   Always       -       168
198 Offline_Uncorrectable   0x0010   099   099   000    Old_age   Offline      -       168
199 UDMA_CRC_Error_Count    0x003e   200   193   000    Old_age   Always       -       179
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1481h+36m+09.376s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9092801864
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       210336716809

SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 16675 hours (694 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00      07:03:05.956  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:03:04.851  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:03:04.738  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:03:04.702  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:03:04.648  READ DMA EXT

Error 1 occurred at disk power-on lifetime: 16675 hours (694 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00      07:01:47.102  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:01:47.057  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:01:46.728  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:01:46.447  READ DMA EXT
  25 00 00 ff ff ff ef 00      07:01:45.946  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9630         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I just received this error:

unRAID Disk 3 SMART message: 08-03-2015 23:36

Notice: Disk 3 passed SMART health check

ST3000DM001-1CH166_Z1F1X5AR (sdf)

 

And immediately after that:

unRAID Disk 3 SMART failure: 08-03-2015 23:37

Alert: Disk 3 failed SMART health check

ST3000DM001-1CH166_Z1F1X5AR (sdf)

 

Which way should I jump? I don't want to lose any data, but I also don't want to pay extra for more drives. I don't have any red balls, and this was literally found tonight thanks to my server emailing me  :D

Link to comment

Haha, somehow I missed that one.. maybe I should've mentioned I was beta-testing some new drives  ::)

 

So, when stopping the array I was alerted "Too many wrong and/or missing disks!"

 

No device is showing for Parity0 and Disk3, this is a little worrisome since they are both showing in a fault state under the dashboard as well. I'm shutting it down now and will be checking all the cabling then starting it back up to see if they're found then :(

Link to comment

Haha, somehow I missed that one.. maybe I should've mentioned I was beta-testing some new drives  ::)

 

So, when stopping the array I was alerted "Too many wrong and/or missing disks!"

 

No device is showing for Parity0 and Disk3, this is a little worrisome since they are both showing in a fault state under the dashboard as well.

This suggests the drives have dropped offline.  That would be consistent with being unable to get sensible SMART reports for them.

 

I'm shutting it down now and will be checking all the cabling then starting it back up to see if they're found then :(

While you are at it, you might want to clear out and dust and check the fans are OK.
Link to comment

Reseated SATA cables, there were a couple that didn't feel solid when unseating/reseating on the mobo. I'm ordering new cables to rule that out in the future. The drives were then seen and SMART status looks good along with short self-test for affected drives.

 

I started a parity-check and after about 5 minutes I got the bad email again :( sdf and sdg aren't seen again. I then saw under the main page that there were "946304" errors on both these drives for a grand total of 1,892,608. I'm shutting down the server until the new cables come in.

 

sdg popped up as being found, then gone again about a minute later. I will dance some SATA cables around and see if the problem follows without mounting the drives.

 

While you are at it, you might want to clear out and dust and check the fans are OK.

 

There was a very fine layer of dust (below normal levels, I clean the filters often), and my drives never go above 32C (only go this high when parity check in progress) All fans are working great.

Link to comment
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Serial Number:    Z1F2269Z
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
5 Reallocated_Sector_Ct   0x0033   085   085   010    Pre-fail  Always       -       19592
...
197 Current_Pending_Sector  0x0012   099   099   000    Old_age   Always       -       168
198 Offline_Uncorrectable   0x0010   099   099   000    Old_age   Offline      -       168
...

Definitely toast on this one.

Link to comment

How are the problem drives connected?  If it is to a daughter board rather than the motherboard you might want to check the board is properly seated.

 

These are connected directly to the mobo.

 

ST3000DM001 - those are the drives BackBlaze has nearly 50% annual failure rate with.

 

Don't waste your time, just get those drives swapped with something that isn't crap as soon as you can.

 

Unfortuantely, I didn't know about how bad Seagate was when I bought these drives a couple years ago, they were known to be a decent model then  :-\ I don't have the funds to replace 6 HDD's right now, but may have to start phasing them out. I wish this issue would've come up sooner when I could've RMA'd one at a time and get new stock in without losing anything, then sell the old drives.

 

I swapped SATA cables out today on sdf and sdg then performed short SMART tests on them.

 

sdg passes with 0 reallocated sectors, 0 pending and 0 offline_uncorrectable.

sdf.. has 4328 reallocated sectors, 4992 pending and 4992 offline_uncorrectable.

sdd won't show the values now - "Read SMART Data failed: Input/output error"

 

I've had to work long hours recently, and got off "early" (6pm) to get the RMA done today, but Seagate closed at 5pm CST for their warranty department  >:( I'm at a loss as to what to do now. Seeing the BackBlaze data was surprising for those Seagate drives, I work at a rather large tech company's data center, and now know not to buy Seagates, but.. those are still some stupid high numbers.

 

I'll call Seagate first thing in the morning and see if I can get an RMA set up. I'll buy a few couple 4TB drives (suggestions at a decent price point and availability?) and try to move data over before shipping out dying drives.

 

When putting in the new (4TB) drives, would ZFS be recommended over RFS?

Link to comment

... got off "early" (6pm) to get the RMA done today, but Seagate closed at 5pm CST for their warranty department ...

 

You can do the RMA entirely online -- no need for a phone call and it works 24/7  :)

http://www.seagate.com/support/warranty-and-replacements/

 

Thanks, I called them yesterday morning and moved forward on the RMA for the drives. I purchased a couple 4TB drives for replacements, and am in the process of making a 4TB drive my parity drive. After that is complete, I will work on formatting the other one as ZFS and copying data from the failing drive over to it before sending off to RMA land. When the "new" replacement drives come in I will start transferring data over and upgrade to ZFS using the method in the sticky.

Link to comment
  • 3 weeks later...

I was thinking XFS, sorry about the confusion there.

 

I've got a scary situation going on now, I have the 2 new drives in (upgraded to 4TB for the 2 failing drives) and I ran pre-clear and did the whole bit on creating a new parity drive and then once that was complete, the 3TB data drive died. After I put in the new drive to replace the data drive it rebuilt it using parity (which I hope is correct, but all I have to go from now) and worked smoothly.

 

Now that I've completed that, parity is valid, and in Main > Array Devices I see that the drive has reiserfs, yet shows Unformatted under "Free", checking out Array Operation it shows this drive under Unformatted disk present. What's going on here? I had to start the array in order to perform the Data Rebuild, now I can't stop the array and take it off-line, if I check the box and select Stop, the page refreshes with is ready to do it again.

 

Just noticed another something interesting, when checking the Array Devices, for the other disks, it's showing 0B Used and 0B Free. Dang, and my shares aren't showing up. I think I just lost everything..

Screen_Shot_2015-03-31_at_9_49.33_PM.png.f1b142e18166d2166d53bed673c675df.png

Link to comment

A disk showing as 'unformatted' does not necessarily mean that is the case - it is a generalised message that unRAID displays if it cannot mount the disk.  Often the cause is corruption of the underlying file system (which is not unlikely if a write to the disk failed) and this is nearly always recoverable using the appropriate recovery tools.

 

Also, what version of unRAID are you using?  I thought the message had changed to 'unmountable' in the latest beta release.

 

Link to comment

Also, what version of unRAID are you using?  I thought the message had changed to 'unmountable' in the latest beta release.

 

I'm running 6.0-beta12.

 

Sorry for the late response, I haven't been home enough to work on the server. What would the appropriate recovery tools be? Will I need to take out the drives, mount new ones, and copy data over? What's the best route to go here?

Link to comment

Also, what version of unRAID are you using?  I thought the message had changed to 'unmountable' in the latest beta release.

 

I'm running 6.0-beta12.

 

Sorry for the late response, I haven't been home enough to work on the server. What would the appropriate recovery tools be? Will I need to take out the drives, mount new ones, and copy data over? What's the best route to go here?

OK - any reason you are not on beta 14b (which is about to be superseded anyway).

 

The recovery tools depend on what file system you are using on the disks.  For Reiserfs it would be reiserfsck and for XFS it would be xfs_repair.  Not sure about BTRFS as I am not using that.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.