SMART errors

SP67 · March 10, 2022

Hi,

This morning the server returned some SMART errors for a 2 TB I use as a torrent download cache before moving data to the array (I've read that this reduces wear on the array).

The errors are:

187 Reported uncorrect 0x0032 096 096 000 Old age Always Never 4

197 Current pending sector 0x0012 100 100 000 Old age Always Never 8

198 Offline uncorrectable 0x0010 100 100 000 Old age Offline Never 8

I've read online that I might be able to ignore the errors as the drive will just stop using those sectors, but there didn't seem to be much consensus about it.

Any suggestion?

Thanks

Full smart report:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.15-Unraid] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-9YN164
Serial Number:    
LU WWN Device Id: 
Firmware Version: CC4B
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 10 12:32:32 2022 CET

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     128 (minimum power consumption without standby)
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(  592) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 247) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   118   099   006    -    170591376
  3 Spin_Up_Time            PO----   093   092   000    -    0
  4 Start_Stop_Count        -O--CK   097   097   020    -    3850
  5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
  7 Seek_Error_Rate         POSR--   076   060   030    -    45291769
  9 Power_On_Hours          -O--CK   091   091   000    -    8254
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   097   097   020    -    3531
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   096   096   000    -    4
188 Command_Timeout         -O--CK   100   099   000    -    1 3 3
189 High_Fly_Writes         -O-RCK   096   096   000    -    4
190 Airflow_Temperature_Cel -O---K   062   051   045    -    38 (Min/Max 21/45 #1)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    191
193 Load_Cycle_Count        -O--CK   001   001   000    -    242985
194 Temperature_Celsius     -O---K   038   049   000    -    38 (128 0 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    8
198 Offline_Uncorrectable   ----C-   100   100   000    -    8
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    4183h+31m+57.974s
241 Total_LBAs_Written      ------   100   253   000    -    183471166860639
242 Total_LBAs_Read         ------   100   253   000    -    81944486528248
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS      20  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5067  Device vendor specific log
0xbd       GPL     VS     512  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 4
	CR     = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH     = LBA High (was: Cylinder High) Register    ]   LBA
	LM     = LBA Mid (was: Cylinder Low) Register      ] Register
	LL     = LBA Low (was: Sector Number) Register     ]
	DV     = Device (was: Device/Head) Register
	DC     = Device Control Register
	ER     = Error register
	ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4 [3] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2e 36 f7 38 00 00  Error: WP at LBA = 0x2e36f738 = 775354168

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 00 00 08 00 00 3c a1 01 60 40 00  1d+05:44:31.301  WRITE FPDMA QUEUED
  61 00 00 00 08 00 00 6f 00 d0 68 40 00  1d+05:44:31.083  WRITE FPDMA QUEUED
  61 00 00 05 20 00 00 6e f4 25 f8 40 00  1d+05:44:31.081  WRITE FPDMA QUEUED
  61 00 00 00 08 00 00 3c a1 01 58 40 00  1d+05:44:31.081  WRITE FPDMA QUEUED
  61 00 00 00 48 00 00 3c 93 15 10 40 00  1d+05:44:31.081  WRITE FPDMA QUEUED

Error 3 [2] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2e 36 f7 38 00 00  Error: WP at LBA = 0x2e36f738 = 775354168

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 00 08 40 00 00 3f 2a 19 80 40 00  1d+05:44:28.089  WRITE FPDMA QUEUED
  61 00 00 00 08 00 00 3f 40 d9 a8 40 00  1d+05:44:28.089  WRITE FPDMA QUEUED
  61 00 00 04 60 00 00 3c 93 0b 38 40 00  1d+05:44:28.088  WRITE FPDMA QUEUED
  61 00 00 00 08 00 00 3c a1 01 50 40 00  1d+05:44:28.088  WRITE FPDMA QUEUED
  60 00 00 00 08 00 00 2e 36 f7 38 40 00  1d+05:44:28.085  READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2e 36 f7 38 00 00  Error: WP at LBA = 0x2e36f738 = 775354168

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 00 04 c0 00 00 16 e6 3c 48 40 00  1d+05:44:25.346  WRITE FPDMA QUEUED
  61 00 00 00 08 00 00 17 1d a5 78 40 00  1d+05:44:25.346  WRITE FPDMA QUEUED
  61 00 00 04 00 00 00 6e f4 1a f8 40 00  1d+05:44:25.346  WRITE FPDMA QUEUED
  61 00 00 00 08 00 00 6f 00 d0 58 40 00  1d+05:44:25.346  WRITE FPDMA QUEUED
  60 00 00 00 08 00 00 2e 36 f7 38 40 00  1d+05:44:25.119  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2e 36 f7 38 00 00  Error: UNC at LBA = 0x2e36f738 = 775354168

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 08 78 00 00 2e 37 1f 30 40 00  1d+05:44:20.856  READ FPDMA QUEUED
  60 00 00 0a 00 00 00 2e 37 15 30 40 00  1d+05:44:20.855  READ FPDMA QUEUED
  60 00 00 00 08 00 00 2e 3f 31 20 40 00  1d+05:44:20.855  READ FPDMA QUEUED
  60 00 00 03 80 00 00 2e 37 11 a8 40 00  1d+05:44:20.855  READ FPDMA QUEUED
  60 00 00 0a 00 00 00 2e 37 07 a8 40 00  1d+05:44:20.855  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      8249         775354168
# 2  Short offline       Completed without error       00%      1536         -
# 3  Short offline       Completed without error       00%       641         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    37 Celsius
Power Cycle Min/Max Temperature:     21/45 Celsius
Lifetime    Min/Max Temperature:      5/49 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

JorgeB · March 10, 2022

Pending sectors can't usually be ignored, unless they are false positives and they don't appear to be since the SMART test failed, you can do a full disk write to see if they return to zero and don't show up again soon after.

SP67 · March 10, 2022

Thanks! Can I do that directly on unRAID?

Although it seems I might be looking at buying another drive...

JorgeB · March 10, 2022

7 minutes ago, SP67 said:

Can I do that directly on unRAID?

You can with pre-clear plugin/docker, disk must be unassigned and any data there will be deleted.

SP67 · March 11, 2022

Reported uncorrect has grown from 4 to 10 in less than 24h. The drive is probably on its last leg...

For what it's worth, I've found that this drive is from the 7200.14 series from Seagate, which had early-death problems that were supposedly fixed with a later firmware. I never saw this update so the drive has been using the factory firmware since I bought it.

trurl · March 11, 2022

I would replace it ASAP just because of the pending, then you can work on seeing if it is worth keeping.

SP67 · March 11, 2022

Can I stop the array, remove the drive and add a new one? Or do I need to shut down the server?

JonathanM · March 11, 2022

5 minutes ago, SP67 said:

Can I stop the array, remove the drive and add a new one? Or do I need to shut down the server?

Depends on your hardware. If everything is compatible and working properly, stopping the array should be enough.

However, it's much safer to power down, and it doesn't really take that much more time.

Your call, but I'd power down, even if I was sure my hardware could handle a hot swap.

trurl · March 11, 2022

2 hours ago, SP67 said:

remove the drive and add a new one

Just to make sure there is no confusion about "adding" disks. You will be replacing a disk not adding one. You will assign the replacement disk to the same slot as the disk you are replacing.

SP67 · March 11, 2022

Ok, so I turned the server down, added a 4 TB disk, moved the contents of the falling drive to the new one and added the new disk to the cache pool. Then I turned down the server again and removed the old drive.

So far so good, everything is going well.

thanks!

trurl · March 12, 2022

14 hours ago, SP67 said:

Ok, so I turned the server down, added a 4 TB disk, moved the contents of the falling drive to the new one and added the new disk to the cache pool. Then I turned down the server again and removed the old drive.

@SP67

Not entirely clear and in any case not what I was recommending.

Do you mean you moved the data from the failing drive to an Unassigned new drive, then assigned that Unassigned drive to cache? And then you shrunk the array by removing the old drive with New Config and rebuilt parity? Seems needlessly complicated but if this is what you did then maybe everything is OK. If this is not what you did then please explain in more detail because it's not clear that everything is OK.

What I had in mind was simply replacing the failing drive with a new drive, assigning that new drive to the slot of the failing drive, and letting it rebuild from parity. Parity can rebuild the contents of a failing drive to a new drive even if you have already thrown the failing drive away. This is the whole reason you have parity.

trurl · March 12, 2022

1 hour ago, trurl said:

shrunk the array by removing the old drive with New Config and rebuilt parity

@SP67

If you removed a drive instead of rebuilding it, and then didn't rebuild parity without the removed drive, then your parity is invalid.

Diagnostics might clear up some of my concerns.

SP67 · March 12, 2022

Yeah, but the failing drive was part of a cache pool (I have one SSD for app data and one HDD for torrent downloads). So AFAIK the parity would no have worked in this case.

Copying the data from the old drive was just to avoid having to download what hadn’t already moved to the array. If this is not the proper way to do it, please correct me as I’m still learning.

trurl · March 12, 2022

OK, I assumed you were working with array disks since didn't have any diagnostics to go on and it was HDD.

Still unclear about this part though

1 minute ago, SP67 said:

part of a cache pool (I have one SSD for app data and one HDD for torrent downloads)

Do you really mean these are separate pools? Because having both in the same pool wouldn't allow you to put different things on each and the SSD could only work at the speed of the HDD if these were in the same pool.

trurl · March 12, 2022

OK, I dug up some of your old diagnostics and it looks like these are separate, single disk pools. Might have been better to make them XFS if you don't plan to have multidisk pool.

SP67 · March 12, 2022

I’m attaching a capture of my array to see if it helps clarify things. Thanks for the interest.

Edited March 12, 2022 by SP67

SP67 · March 12, 2022

1 minute ago, trurl said:

OK, I dug up some of your old diagnostics and it looks like these are separate, single disk pools. Might have been better to make them XFS if you don't plan to have multidisk pool.

How should I do that? Or is it too late?

SMART errors

Recommended Posts

SP67

Link to comment

JorgeB

Link to comment

SP67

Link to comment

JorgeB

Link to comment

SP67

Link to comment

trurl

Link to comment

SP67

Link to comment

JonathanM

Link to comment

trurl

Link to comment

SP67

Link to comment

trurl

Link to comment

trurl

Link to comment

SP67

Link to comment

trurl

Link to comment

trurl

Link to comment

SP67

Link to comment

SP67

Link to comment

Join the conversation