Do I need to replace my Cache-Drive?

December 26, 20223 yr

If this is the wrong Sub-Forum, please just hint me into the right direction. Over a year ago Unraid reported a SMART-Error for one of my SSDs in my cache-pool. I confirmed the warning and since then no new errors where reported. Now I realised that I cannot run manual short or extended SMART-tests anymore (when I click, just nothing happens).

There is no really important data on my Cache-Pool, which is running as a btrfs raid5. Anyway I wanted to know if I should replace that drive. I was planning to upgrade my cache soon.

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.17-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron RealSSD m4/C400/P400
Device Model:     M4-CT256M4SSD2
Serial Number:    0000000013010925333F
LU WWN Device Id: 5 00a075 10925333f
Firmware Version: 040H
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database 7.3/5417
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 26 16:19:15 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  64)	The previous self-test completed having
					a test element that failed and the test
					element that failed is not known.
Total time to complete Offline 
data collection: 		( 1190) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  19) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   001   001   050    NOW  310
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    12288
  9 Power_On_Hours          -O--CK   100   100   001    -    38257
 12 Power_Cycle_Count       -O--CK   100   100   001    -    3089
170 Grown_Failing_Block_Ct  PO--CK   100   100   010    -    3
171 Program_Fail_Count      -O--CK   100   100   001    -    78
172 Erase_Fail_Count        -O--CK   100   100   001    -    0
173 Wear_Leveling_Count     PO--CK   082   082   010    -    563
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   001    -    76
181 Non4k_Aligned_Access    -O---K   100   100   001    -    1477 1182 295
183 SATA_Iface_Downshift    -O--CK   100   100   001    -    0
184 End-to-End_Error        PO--CK   100   100   050    -    0
187 Reported_Uncorrect      -O--CK   100   100   001    -    0
188 Command_Timeout         -O--CK   100   100   001    -    0
189 Factory_Bad_Block_Ct    -OSR--   100   100   001    -    82
194 Temperature_Celsius     -O---K   100   100   000    -    0
195 Hardware_ECC_Recovered  -O-RCK   100   100   001    -    93333
196 Reallocated_Event_Count -O--CK   100   100   001    -    3
197 Current_Pending_Sector  -O--CK   100   100   001    -    0
198 Offline_Uncorrectable   ----CK   100   100   001    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   001    -    0
202 Perc_Rated_Life_Used    ---RC-   082   082   001    -    18
206 Write_Error_Rate        -OSR--   100   100   001    -    78
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O  16383  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O   3449  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0       GPL     VS    2000  Device vendor specific log
0xa0       SL      VS     208  Device vendor specific log
0xa1-0xbf  GPL,SL  VS       1  Device vendor specific log
0xc0       GPL     VS      80  Device vendor specific log
0xc1-0xdf  GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (16383 sectors)
No Errors Logged

SMART Extended Self-test Log size 3449 not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     38257         -
# 2  Short offline       Completed without error       00%     37923         -
# 3  Short offline       Completed without error       00%     37923         -
# 4  Short offline       Completed without error       00%     37923         -
# 5  Extended offline    Completed without error       00%     37898         -
# 6  Extended offline    Completed without error       00%     37897         -
# 7  Extended offline    Completed without error       00%     37897         -
# 8  Extended offline    Completed without error       00%     37897         -
# 9  Short offline       Completed without error       00%     37897         -
#10  Extended offline    Completed without error       00%     33020         -
#11  Short offline       Completed without error       00%     33016         -
#12  Extended offline    Completed without error       00%     33016         -
#13  Extended offline    Completed without error       00%     33016         -
#14  Short offline       Completed without error       00%     29870         -
#15  Vendor (0xff)       Completed without error       00%     27430         -
#16  Short offline       Completed without error       00%     26310         -
#17  Vendor (0xff)       Completed without error       00%     26284         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       1 (0x0001)
Device State:                        Active (0)
Current Temperature:                     0 Celsius
Power Cycle Min/Max Temperature:     --/ 0 Celsius
Lifetime    Min/Max Temperature:     --/ 0 Celsius

SCT Temperature History Version:     2
Temperature Sampling Period:         10 minutes
Temperature Logging Interval:        10 minutes
Min/Max recommended Temperature:      0/70 Celsius
Min/Max Temperature Limit:           -5/75 Celsius
Temperature History Size (Index):    478 (73)

Index    Estimated Time   Temperature Celsius
  74    2022-12-23 08:40     0  -
 ...    ..(476 skipped).    ..  -
  73    2022-12-26 16:10     0  -

SCT Error Recovery Control:
           Read:     40 (4.0 seconds)
          Write:     40 (4.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4            3089  ---  Lifetime Power-On Resets
0x01  0x010  4           38257  ---  Power-on Hours
0x01  0x018  6    104764793610  ---  Logical Sectors Written
0x01  0x020  6       930655381  ---  Number of Write Commands
0x01  0x028  6    186135985241  ---  Logical Sectors Read
0x01  0x030  6      1119502745  ---  Number of Read Commands
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1               0  ---  Current Temperature
0x05  0x010  1               0  ---  Average Short Term Temperature
0x05  0x018  1               0  ---  Average Long Term Temperature
0x05  0x020  1               0  ---  Highest Temperature
0x05  0x028  1               0  ---  Lowest Temperature
0x05  0x030  1               0  ---  Highest Average Short Term Temperature
0x05  0x038  1               0  ---  Lowest Average Short Term Temperature
0x05  0x040  1               0  ---  Highest Average Long Term Temperature
0x05  0x048  1               0  ---  Lowest Average Long Term Temperature
0x05  0x050  4               -  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               -  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               0  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               2  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x000a  4           22  Device-to-host register FISes sent due to a COMRESET

Quote

December 27, 20223 yr

Community Expert

Not a good sign, I would replace it.

Quote

December 27, 20223 yr

Community Expert
Solution

  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    12288

I am surprised that you have not had issues previously with this drive. A small number of reallocated sectors is normally OK as long as the number remains stable, but a number this large is definitely a sign of trouble.

Quote

January 16, 20233 yr

Author

Just to let you know in case anyone reads this later: I switched out the faulty drive by now; but I had to use a backup to rebuild the contents of my cache pool -- which is all my fault.

I started with a btrfs raid5 (5 drives) and I wanted to remove 2 drives (the faulty one and another small old SSD, only one at a time, of course) and replace it by a newer, bigger one.

This is how not to do it: I think I messed things up when I replaced the small healthy drive by the larger SSD (insted of getting rid of the faulty one first). I performed a balance operation (I think it started automatically, the UI does not tell you what Unraid is doing at this point!) and then a scub, which threw a few hundred of errors on me. I performed a second, error correcting scrub operation which left around 100 errors uncorrected. Then I removed the faulty SSD (which just was involved In a full balance operation with lots of data not only read off it, but rewritten to it...), perfomed balance again. I was at my desired config now (4-Drive-Raid5). The next scrub check threw over 9.000.000 errors at me, all uncorrectable. In the logs I found only around 300 faulty files, but even after deleting them, I still got around 5-6million errors. Then I decided that I seriously have fucked things up and started from scratch.

Now my new 4-disk-raid5-pool is fine (no more smart errors, no scrub errors) and thanks to full backups I did not suffer any dataloss. (I would not have tried that procedure otherwise.)

Quote

Do I need to replace my Cache-Drive?

Featured Replies

Solved by itimpi

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)