SP67 Posted March 10, 2022 Share Posted March 10, 2022 Hi, This morning the server returned some SMART errors for a 2 TB I use as a torrent download cache before moving data to the array (I've read that this reduces wear on the array). The errors are: 187 Reported uncorrect 0x0032 096 096 000 Old age Always Never 4 197 Current pending sector 0x0012 100 100 000 Old age Always Never 8 198 Offline uncorrectable 0x0010 100 100 000 Old age Offline Never 8 I've read online that I might be able to ignore the errors as the drive will just stop using those sectors, but there didn't seem to be much consensus about it. Any suggestion? Thanks Full smart report: smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.14.15-Unraid] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-9YN164 Serial Number: LU WWN Device Id: Firmware Version: CC4B User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Mar 10 12:32:32 2022 CET ==> WARNING: A firmware update for this drive may be available, see the following Seagate web pages: http://knowledge.seagate.com/articles/en_US/FAQ/207931en http://knowledge.seagate.com/articles/en_US/FAQ/223651en SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 128 (minimum power consumption without standby) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Unavailable === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: ( 592) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 247) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 118 099 006 - 170591376 3 Spin_Up_Time PO---- 093 092 000 - 0 4 Start_Stop_Count -O--CK 097 097 020 - 3850 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 7 Seek_Error_Rate POSR-- 076 060 030 - 45291769 9 Power_On_Hours -O--CK 091 091 000 - 8254 10 Spin_Retry_Count PO--C- 100 100 097 - 0 12 Power_Cycle_Count -O--CK 097 097 020 - 3531 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error -O--CK 100 100 099 - 0 187 Reported_Uncorrect -O--CK 096 096 000 - 4 188 Command_Timeout -O--CK 100 099 000 - 1 3 3 189 High_Fly_Writes -O-RCK 096 096 000 - 4 190 Airflow_Temperature_Cel -O---K 062 051 045 - 38 (Min/Max 21/45 #1) 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 192 Power-Off_Retract_Count -O--CK 100 100 000 - 191 193 Load_Cycle_Count -O--CK 001 001 000 - 242985 194 Temperature_Celsius -O---K 038 049 000 - 38 (128 0 0 0 0) 197 Current_Pending_Sector -O--C- 100 100 000 - 8 198 Offline_Uncorrectable ----C- 100 100 000 - 8 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 240 Head_Flying_Hours ------ 100 253 000 - 4183h+31m+57.974s 241 Total_LBAs_Written ------ 100 253 000 - 183471166860639 242 Total_LBAs_Read ------ 100 253 000 - 81944486528248 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 5 Comprehensive SMART error log 0x03 GPL R/O 5 Ext. Comprehensive SMART error log 0x06 SL R/O 1 SMART self-test log 0x07 GPL R/O 1 Extended self-test log 0x09 SL R/W 1 Selective self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL R/O 1 SATA Phy Event Counters log 0x21 GPL R/O 1 Write stream error log 0x22 GPL R/O 1 Read stream error log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xa1 GPL,SL VS 20 Device vendor specific log 0xa2 GPL VS 4496 Device vendor specific log 0xa8 GPL,SL VS 20 Device vendor specific log 0xa9 GPL,SL VS 1 Device vendor specific log 0xab GPL VS 1 Device vendor specific log 0xb0 GPL VS 5067 Device vendor specific log 0xbd GPL VS 512 Device vendor specific log 0xbe-0xbf GPL VS 65535 Device vendor specific log 0xc0 GPL,SL VS 1 Device vendor specific log 0xe0 GPL,SL R/W 1 SCT Command/Status 0xe1 GPL,SL R/W 1 SCT Data Transfer SMART Extended Comprehensive Error Log Version: 1 (5 sectors) Device Error Count: 4 CR = Command Register FEATR = Features Register COUNT = Count (was: Sector Count) Register LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8 LH = LBA High (was: Cylinder High) Register ] LBA LM = LBA Mid (was: Cylinder Low) Register ] Register LL = LBA Low (was: Sector Number) Register ] DV = Device (was: Device/Head) Register DC = Device Control Register ER = Error register ST = Status register Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 4 [3] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 2e 36 f7 38 00 00 Error: WP at LBA = 0x2e36f738 = 775354168 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 00 00 00 08 00 00 3c a1 01 60 40 00 1d+05:44:31.301 WRITE FPDMA QUEUED 61 00 00 00 08 00 00 6f 00 d0 68 40 00 1d+05:44:31.083 WRITE FPDMA QUEUED 61 00 00 05 20 00 00 6e f4 25 f8 40 00 1d+05:44:31.081 WRITE FPDMA QUEUED 61 00 00 00 08 00 00 3c a1 01 58 40 00 1d+05:44:31.081 WRITE FPDMA QUEUED 61 00 00 00 48 00 00 3c 93 15 10 40 00 1d+05:44:31.081 WRITE FPDMA QUEUED Error 3 [2] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 2e 36 f7 38 00 00 Error: WP at LBA = 0x2e36f738 = 775354168 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 00 00 08 40 00 00 3f 2a 19 80 40 00 1d+05:44:28.089 WRITE FPDMA QUEUED 61 00 00 00 08 00 00 3f 40 d9 a8 40 00 1d+05:44:28.089 WRITE FPDMA QUEUED 61 00 00 04 60 00 00 3c 93 0b 38 40 00 1d+05:44:28.088 WRITE FPDMA QUEUED 61 00 00 00 08 00 00 3c a1 01 50 40 00 1d+05:44:28.088 WRITE FPDMA QUEUED 60 00 00 00 08 00 00 2e 36 f7 38 40 00 1d+05:44:28.085 READ FPDMA QUEUED Error 2 [1] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 2e 36 f7 38 00 00 Error: WP at LBA = 0x2e36f738 = 775354168 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 00 00 04 c0 00 00 16 e6 3c 48 40 00 1d+05:44:25.346 WRITE FPDMA QUEUED 61 00 00 00 08 00 00 17 1d a5 78 40 00 1d+05:44:25.346 WRITE FPDMA QUEUED 61 00 00 04 00 00 00 6e f4 1a f8 40 00 1d+05:44:25.346 WRITE FPDMA QUEUED 61 00 00 00 08 00 00 6f 00 d0 58 40 00 1d+05:44:25.346 WRITE FPDMA QUEUED 60 00 00 00 08 00 00 2e 36 f7 38 40 00 1d+05:44:25.119 READ FPDMA QUEUED Error 1 [0] occurred at disk power-on lifetime: 8245 hours (343 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 2e 36 f7 38 00 00 Error: UNC at LBA = 0x2e36f738 = 775354168 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 00 08 78 00 00 2e 37 1f 30 40 00 1d+05:44:20.856 READ FPDMA QUEUED 60 00 00 0a 00 00 00 2e 37 15 30 40 00 1d+05:44:20.855 READ FPDMA QUEUED 60 00 00 00 08 00 00 2e 3f 31 20 40 00 1d+05:44:20.855 READ FPDMA QUEUED 60 00 00 03 80 00 00 2e 37 11 a8 40 00 1d+05:44:20.855 READ FPDMA QUEUED 60 00 00 0a 00 00 00 2e 37 07 a8 40 00 1d+05:44:20.855 READ FPDMA QUEUED SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 8249 775354168 # 2 Short offline Completed without error 00% 1536 - # 3 Short offline Completed without error 00% 641 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 3 SCT Version (vendor specific): 522 (0x020a) Device State: Active (0) Current Temperature: 37 Celsius Power Cycle Min/Max Temperature: 21/45 Celsius Lifetime Min/Max Temperature: 5/49 Celsius Under/Over Temperature Limit Count: 0/0 SCT Data Table command not supported SCT Error Recovery Control command not supported Device Statistics (GP/SMART Log 0x04) not supported Pending Defects log (GP Log 0x0c) not supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 3 Device-to-host register FISes sent due to a COMRESET 0x0001 2 0 Command failed due to ICRC error 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS Quote Link to comment
JorgeB Posted March 10, 2022 Share Posted March 10, 2022 Pending sectors can't usually be ignored, unless they are false positives and they don't appear to be since the SMART test failed, you can do a full disk write to see if they return to zero and don't show up again soon after. 1 Quote Link to comment
SP67 Posted March 10, 2022 Author Share Posted March 10, 2022 Thanks! Can I do that directly on unRAID? Although it seems I might be looking at buying another drive... Quote Link to comment
JorgeB Posted March 10, 2022 Share Posted March 10, 2022 7 minutes ago, SP67 said: Can I do that directly on unRAID? You can with pre-clear plugin/docker, disk must be unassigned and any data there will be deleted. Quote Link to comment
SP67 Posted March 11, 2022 Author Share Posted March 11, 2022 Reported uncorrect has grown from 4 to 10 in less than 24h. The drive is probably on its last leg... For what it's worth, I've found that this drive is from the 7200.14 series from Seagate, which had early-death problems that were supposedly fixed with a later firmware. I never saw this update so the drive has been using the factory firmware since I bought it. Quote Link to comment
trurl Posted March 11, 2022 Share Posted March 11, 2022 I would replace it ASAP just because of the pending, then you can work on seeing if it is worth keeping. Quote Link to comment
SP67 Posted March 11, 2022 Author Share Posted March 11, 2022 Can I stop the array, remove the drive and add a new one? Or do I need to shut down the server? Quote Link to comment
JonathanM Posted March 11, 2022 Share Posted March 11, 2022 5 minutes ago, SP67 said: Can I stop the array, remove the drive and add a new one? Or do I need to shut down the server? Depends on your hardware. If everything is compatible and working properly, stopping the array should be enough. However, it's much safer to power down, and it doesn't really take that much more time. Your call, but I'd power down, even if I was sure my hardware could handle a hot swap. Quote Link to comment
trurl Posted March 11, 2022 Share Posted March 11, 2022 2 hours ago, SP67 said: remove the drive and add a new one Just to make sure there is no confusion about "adding" disks. You will be replacing a disk not adding one. You will assign the replacement disk to the same slot as the disk you are replacing. Quote Link to comment
SP67 Posted March 11, 2022 Author Share Posted March 11, 2022 Ok, so I turned the server down, added a 4 TB disk, moved the contents of the falling drive to the new one and added the new disk to the cache pool. Then I turned down the server again and removed the old drive. So far so good, everything is going well. thanks! Quote Link to comment
trurl Posted March 12, 2022 Share Posted March 12, 2022 14 hours ago, SP67 said: Ok, so I turned the server down, added a 4 TB disk, moved the contents of the falling drive to the new one and added the new disk to the cache pool. Then I turned down the server again and removed the old drive. @SP67 Not entirely clear and in any case not what I was recommending. Do you mean you moved the data from the failing drive to an Unassigned new drive, then assigned that Unassigned drive to cache? And then you shrunk the array by removing the old drive with New Config and rebuilt parity? Seems needlessly complicated but if this is what you did then maybe everything is OK. If this is not what you did then please explain in more detail because it's not clear that everything is OK. What I had in mind was simply replacing the failing drive with a new drive, assigning that new drive to the slot of the failing drive, and letting it rebuild from parity. Parity can rebuild the contents of a failing drive to a new drive even if you have already thrown the failing drive away. This is the whole reason you have parity. Quote Link to comment
trurl Posted March 12, 2022 Share Posted March 12, 2022 1 hour ago, trurl said: shrunk the array by removing the old drive with New Config and rebuilt parity @SP67 If you removed a drive instead of rebuilding it, and then didn't rebuild parity without the removed drive, then your parity is invalid. Diagnostics might clear up some of my concerns. Quote Link to comment
SP67 Posted March 12, 2022 Author Share Posted March 12, 2022 Yeah, but the failing drive was part of a cache pool (I have one SSD for app data and one HDD for torrent downloads). So AFAIK the parity would no have worked in this case. Copying the data from the old drive was just to avoid having to download what hadn’t already moved to the array. If this is not the proper way to do it, please correct me as I’m still learning. Quote Link to comment
trurl Posted March 12, 2022 Share Posted March 12, 2022 OK, I assumed you were working with array disks since didn't have any diagnostics to go on and it was HDD. Still unclear about this part though 1 minute ago, SP67 said: part of a cache pool (I have one SSD for app data and one HDD for torrent downloads) Do you really mean these are separate pools? Because having both in the same pool wouldn't allow you to put different things on each and the SSD could only work at the speed of the HDD if these were in the same pool. Quote Link to comment
trurl Posted March 12, 2022 Share Posted March 12, 2022 OK, I dug up some of your old diagnostics and it looks like these are separate, single disk pools. Might have been better to make them XFS if you don't plan to have multidisk pool. Quote Link to comment
SP67 Posted March 12, 2022 Author Share Posted March 12, 2022 (edited) I’m attaching a capture of my array to see if it helps clarify things. Thanks for the interest. Edited March 12, 2022 by SP67 Quote Link to comment
SP67 Posted March 12, 2022 Author Share Posted March 12, 2022 1 minute ago, trurl said: OK, I dug up some of your old diagnostics and it looks like these are separate, single disk pools. Might have been better to make them XFS if you don't plan to have multidisk pool. How should I do that? Or is it too late? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.