January 1, 20179 yr As usual, my monthly parity check began today. Shortly after the check started, I got a series of 2029 read errors on Parity 2, my newest WD 4TB Red drive (maybe 3-4 months old). The check is still ongoing, and despite the probable slowdown to the parity check, I kicked off a long SMART test on the drive as well (which is still in progress as I type this). Any thoughts on the best path forward? I have a warm spare of suitable size so could quickly swap it out without taking any chances, but I'm not sure if it was just a fluke. Maybe I should pull the server out, check all of my connections, and re-run parity check? My power supply (Seasonic X-750 Gold) has been absolutely rock solid to this point, but it is getting up there in age. Diagnostics including logs attached. Thanks for the help! hyperion-diagnostics-20170101-1021.zip
January 1, 20179 yr Community Expert SMART looks OK. I would cancel both parity check and SMART test, shutdown, check connections, then do a long SMART test and if that was OK then do the parity check.
January 1, 20179 yr Author Thanks. Looks like I'm about 60 minutes away from completion anyway, so I guess I'll let it run and then do what you suggest. Thanks for taking a look!
January 5, 20179 yr Author So I re-ran the extended test. This was the result: smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.30-unRAID] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Serial Number: WD-WCC4E0XD509S LU WWN Device Id: 5 0014ee 20d3b9f1b Firmware Version: 82.00A82 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jan 4 22:40:11 2017 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (52980) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 530) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 36 3 Spin_Up_Time 0x0027 182 182 021 Pre-fail Always - 7866 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2624 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2736 194 Temperature_Celsius 0x0022 120 116 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 3 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 2622 126120736 # 2 Extended offline Completed: read failure 90% 2601 126117304 # 3 Extended offline Aborted by host 90% 2601 - # 4 Extended offline Completed: read failure 90% 2545 126117304 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I'm not entirely sure what to make of it. It looks like the test completed with a read failure? It's under warranty, so I suspect it's time to send it back unless you have any other suggestions. Strangely, unRAID hasn't reported any read errors on the drive since I powered down, checked the cabling, and started the SMART test. I guess the SMART test's read/writes aren't included in the totals on the unRAID Main page.
January 5, 20179 yr I'd be inclined to get it replaced, simply due to the read errors and the fact you've had issues with it.
January 5, 20179 yr Community Expert I'm not entirely sure what to make of it. It looks like the test completed with a read failure? It's under warranty, so I suspect it's time to send it back unless you have any other suggestions. Strangely, unRAID hasn't reported any read errors on the drive since I powered down, checked the cabling, and started the SMART test. I guess the SMART test's read/writes aren't included in the totals on the unRAID Main page. Disk failed the extended SMART test so it needs to be replaced, unRAID read errors are reset after a reboot, if not before that you'd have some more on the next parity check.
January 5, 20179 yr I had similar errors on the test and I just decided to buy a new drive. Although mine drive was older than yours... On the bright side you still have the warranty, right?
January 5, 20179 yr The check is still ongoing, and despite the probable slowdown to the parity check, I kicked off a long SMART test on the drive as well (which is still in progress as I type this). Just a comment on this, you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive. The long test also aborts as soon as it finds the first read error, and the test result then shows as 'Completed' with the rough percentile it stopped within (a multiple of 10), and the LBA of that first error. Most tests that stop quickly show as 90% remaining to test (stopped within the first 10%). The word 'Completed' is usually misleading, the only time it means it finished testing the entire surface is if there were no read errors (LBA is blank) and 'Remaining' is 0%. You would think that by now someone would have improved the extended SMART test to avoid aborting, and to test the entire surface, returning the total number of LBA's with read errors! Edit: correcting my very wrong understanding of it
January 5, 20179 yr you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive. Interesting. I didn't know that. I assumed it just carries on from where it left off when the other thing accessing the disk has finished. The thing about SMART is that it was probably just about good enough when it was first introduced but it hasn't really been developed since then. It could be so much better.
January 5, 20179 yr Author you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive. Interesting. I didn't know that. I assumed it just carries on from where it left off when the other thing accessing the disk has finished. The thing about SMART is that it was probably just about good enough when it was first introduced but it hasn't really been developed since then. It could be so much better. Same here. That initial SMART Extended test ran for about 8 hours before completing. I kept checking on the status and it always said 10%, then eventually I came back and it was no longer in progress. This was while the parity check was running. So I started it again after the parity check completed, but realized I hadn't checked the cables, so I aborted it. I adjusted the cabling, powered on, and ran it twice more, both failing. Based on the steps I took and the SMART history (read failure, aborted, read failure, read failure), and the report from my first post that shows the test being in process while a parity check is running, it looks like the test can interleave with reads, but I'm no expert. The drive has been RMAd. I'll swap in my warm spare this evening while I wait for the replacement drive.
January 6, 20179 yr you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive. Interesting. I didn't know that. I assumed it just carries on from where it left off when the other thing accessing the disk has finished. The thing about SMART is that it was probably just about good enough when it was first introduced but it hasn't really been developed since then. It could be so much better. Same here. That initial SMART Extended test ran for about 8 hours before completing. I kept checking on the status and it always said 10%, then eventually I came back and it was no longer in progress. This was while the parity check was running. So I started it again after the parity check completed, but realized I hadn't checked the cables, so I aborted it. I adjusted the cabling, powered on, and ran it twice more, both failing. Based on the steps I took and the SMART history (read failure, aborted, read failure, read failure), and the report from my first post that shows the test being in process while a parity check is running, it looks like the test can interleave with reads, but I'm no expert. The drive has been RMAd. I'll swap in my warm spare this evening while I wait for the replacement drive. I decided I better go recheck my assumptions here, and while I didn't find anything authoritative, I have to admit I definitely 'jumped to an assumption'! I must have misunderstood someone way back, that the polling time was the time before you could grab the results, and if you didn't you could interrupt the test. Apparently that's not true. The polling time is just helpful info, the earliest time the test could finish if there's no other I/O to the drive. If there's concurrent I/O, the test will take longer. And I knew that the test could be inadvertently interrupted, aborted I thought because of other I/O, but actually the only known cause is a spin down command, which could I believe explain the unexpected aborts of the long test that I've seen. Some recommend setting up a loop to read a block from the drive every so often, to keep the OS from issuing a spin down. So it should be safe to run other things at the same time. Apparently the extended test is run in the background at a lower priority, so that any regular I/O can proceed as quickly as possible. I was wrong!
January 6, 20179 yr That's how I had assumed it was, through running extended SMART tests on active system drives (Mac OS X and Linux) while using the computers normally and seeing the tests run to completion, though I've never seen it documented anywhere. Thanks for correcting your error, Rob, and respect for admitting you were wrong.
January 6, 20179 yr Community Expert I decided I better go recheck my assumptions here, and while I didn't find anything authoritative, I have to admit I definitely 'jumped to an assumption'! I must have misunderstood someone way back, that the polling time was the time before you could grab the results, and if you didn't you could interrupt the test. Apparently that's not true. The polling time is just helpful info, the earliest time the test could finish if there's no other I/O to the drive. If there's concurrent I/O, the test will take longer. And I knew that the test could be inadvertently interrupted, aborted I thought because of other I/O, but actually the only known cause is a spin down command, which could I believe explain the unexpected aborts of the long test that I've seen. Some recommend setting up a loop to read a block from the drive every so often, to keep the OS from issuing a spin down. So it should be safe to run other things at the same time. Apparently the extended test is run in the background at a lower priority, so that any regular I/O can proceed as quickly as possible. I was wrong! Not sure when it was implemented, but I've recently done an extended test under 6.3rc6 and it automatically disabled spindown for the drive while the test ran.
January 6, 20179 yr Community Expert I decided I better go recheck my assumptions here, and while I didn't find anything authoritative, I have to admit I definitely 'jumped to an assumption'! I must have misunderstood someone way back, that the polling time was the time before you could grab the results, and if you didn't you could interrupt the test. Apparently that's not true. The polling time is just helpful info, the earliest time the test could finish if there's no other I/O to the drive. If there's concurrent I/O, the test will take longer. And I knew that the test could be inadvertently interrupted, aborted I thought because of other I/O, but actually the only known cause is a spin down command, which could I believe explain the unexpected aborts of the long test that I've seen. Some recommend setting up a loop to read a block from the drive every so often, to keep the OS from issuing a spin down. So it should be safe to run other things at the same time. Apparently the extended test is run in the background at a lower priority, so that any regular I/O can proceed as quickly as possible. I was wrong! Not sure when it was implemented, but I've recently done an extended test under 6.3rc6 and it automatically disabled spindown for the drive while the test ran. IIRC this was done on v6.1.something.
Archived
This topic is now archived and is closed to further replies.