Read errors on newish parity drive

January 1, 20179 yr

As usual, my monthly parity check began today. Shortly after the check started, I got a series of 2029 read errors on Parity 2, my newest WD 4TB Red drive (maybe 3-4 months old). The check is still ongoing, and despite the probable slowdown to the parity check, I kicked off a long SMART test on the drive as well (which is still in progress as I type this).

Any thoughts on the best path forward? I have a warm spare of suitable size so could quickly swap it out without taking any chances, but I'm not sure if it was just a fluke. Maybe I should pull the server out, check all of my connections, and re-run parity check? My power supply (Seasonic X-750 Gold) has been absolutely rock solid to this point, but it is getting up there in age.

Diagnostics including logs attached. Thanks for the help!

hyperion-diagnostics-20170101-1021.zip

Quote

January 1, 20179 yr

Community Expert

SMART looks OK. I would cancel both parity check and SMART test, shutdown, check connections, then do a long SMART test and if that was OK then do the parity check.

Quote

January 1, 20179 yr

Author

Thanks. Looks like I'm about 60 minutes away from completion anyway, so I guess I'll let it run and then do what you suggest. Thanks for taking a look!

Quote

January 5, 20179 yr

Author

So I re-ran the extended test. This was the result:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.30-unRAID] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E0XD509S
LU WWN Device Id: 5 0014ee 20d3b9f1b
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan  4 22:40:11 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		(52980) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 530) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x703d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       36
  3 Spin_Up_Time            0x0027   182   182   021    Pre-fail  Always       -       7866
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2624
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2736
194 Temperature_Celsius     0x0022   120   116   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       3

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2622         126120736
# 2  Extended offline    Completed: read failure       90%      2601         126117304
# 3  Extended offline    Aborted by host               90%      2601         -
# 4  Extended offline    Completed: read failure       90%      2545         126117304

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I'm not entirely sure what to make of it. It looks like the test completed with a read failure? It's under warranty, so I suspect it's time to send it back unless you have any other suggestions. Strangely, unRAID hasn't reported any read errors on the drive since I powered down, checked the cabling, and started the SMART test. I guess the SMART test's read/writes aren't included in the totals on the unRAID Main page.

Quote

January 5, 20179 yr

I'd be inclined to get it replaced, simply due to the read errors and the fact you've had issues with it.

Quote

January 5, 20179 yr

Community Expert

I'm not entirely sure what to make of it. It looks like the test completed with a read failure? It's under warranty, so I suspect it's time to send it back unless you have any other suggestions. Strangely, unRAID hasn't reported any read errors on the drive since I powered down, checked the cabling, and started the SMART test. I guess the SMART test's read/writes aren't included in the totals on the unRAID Main page.

Disk failed the extended SMART test so it needs to be replaced, unRAID read errors are reset after a reboot, if not before that you'd have some more on the next parity check.

Quote

January 5, 20179 yr

I had similar errors on the test and I just decided to buy a new drive. Although mine drive was older than yours... On the bright side you still have the warranty, right?

Quote

January 5, 20179 yr

The check is still ongoing, and despite the probable slowdown to the parity check, I kicked off a long SMART test on the drive as well (which is still in progress as I type this).

Just a comment on this, you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive. The long test also aborts as soon as it finds the first read error, and the test result then shows as 'Completed' with the rough percentile it stopped within (a multiple of 10), and the LBA of that first error. Most tests that stop quickly show as 90% remaining to test (stopped within the first 10%). The word 'Completed' is usually misleading, the only time it means it finished testing the entire surface is if there were no read errors (LBA is blank) and 'Remaining' is 0%.

You would think that by now someone would have improved the extended SMART test to avoid aborting, and to test the entire surface, returning the total number of LBA's with read errors!

Edit: correcting my very wrong understanding of it

Quote

January 5, 20179 yr

you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive.

Interesting. I didn't know that. I assumed it just carries on from where it left off when the other thing accessing the disk has finished. The thing about SMART is that it was probably just about good enough when it was first introduced but it hasn't really been developed since then. It could be so much better.

Quote

January 5, 20179 yr

Author

you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive.

Interesting. I didn't know that. I assumed it just carries on from where it left off when the other thing accessing the disk has finished. The thing about SMART is that it was probably just about good enough when it was first introduced but it hasn't really been developed since then. It could be so much better.

Same here. That initial SMART Extended test ran for about 8 hours before completing. I kept checking on the status and it always said 10%, then eventually I came back and it was no longer in progress. This was while the parity check was running.

So I started it again after the parity check completed, but realized I hadn't checked the cables, so I aborted it. I adjusted the cabling, powered on, and ran it twice more, both failing.

Based on the steps I took and the SMART history (read failure, aborted, read failure, read failure), and the report from my first post that shows the test being in process while a parity check is running, it looks like the test can interleave with reads, but I'm no expert.

The drive has been RMAd. I'll swap in my warm spare this evening while I wait for the replacement drive.

Quote

January 6, 20179 yr

you can't run an extended SMART test at the same time as *anything* else. The long SMART test would have aborted as soon as anything else tried to access the drive.

Interesting. I didn't know that. I assumed it just carries on from where it left off when the other thing accessing the disk has finished. The thing about SMART is that it was probably just about good enough when it was first introduced but it hasn't really been developed since then. It could be so much better.

Same here. That initial SMART Extended test ran for about 8 hours before completing. I kept checking on the status and it always said 10%, then eventually I came back and it was no longer in progress. This was while the parity check was running.

So I started it again after the parity check completed, but realized I hadn't checked the cables, so I aborted it. I adjusted the cabling, powered on, and ran it twice more, both failing.

Based on the steps I took and the SMART history (read failure, aborted, read failure, read failure), and the report from my first post that shows the test being in process while a parity check is running, it looks like the test can interleave with reads, but I'm no expert.

The drive has been RMAd. I'll swap in my warm spare this evening while I wait for the replacement drive.

I decided I better go recheck my assumptions here, and while I didn't find anything authoritative, I have to admit I definitely 'jumped to an assumption'! I must have misunderstood someone way back, that the polling time was the time before you could grab the results, and if you didn't you could interrupt the test. Apparently that's not true. The polling time is just helpful info, the earliest time the test could finish if there's no other I/O to the drive. If there's concurrent I/O, the test will take longer. And I knew that the test could be inadvertently interrupted, aborted I thought because of other I/O, but actually the only known cause is a spin down command, which could I believe explain the unexpected aborts of the long test that I've seen. Some recommend setting up a loop to read a block from the drive every so often, to keep the OS from issuing a spin down.

So it should be safe to run other things at the same time. Apparently the extended test is run in the background at a lower priority, so that any regular I/O can proceed as quickly as possible. I was wrong!

Quote

January 6, 20179 yr

That's how I had assumed it was, through running extended SMART tests on active system drives (Mac OS X and Linux) while using the computers normally and seeing the tests run to completion, though I've never seen it documented anywhere. Thanks for correcting your error, Rob, and respect for admitting you were wrong.

Quote

January 6, 20179 yr

Community Expert

I decided I better go recheck my assumptions here, and while I didn't find anything authoritative, I have to admit I definitely 'jumped to an assumption'! I must have misunderstood someone way back, that the polling time was the time before you could grab the results, and if you didn't you could interrupt the test. Apparently that's not true. The polling time is just helpful info, the earliest time the test could finish if there's no other I/O to the drive. If there's concurrent I/O, the test will take longer. And I knew that the test could be inadvertently interrupted, aborted I thought because of other I/O, but actually the only known cause is a spin down command, which could I believe explain the unexpected aborts of the long test that I've seen. Some recommend setting up a loop to read a block from the drive every so often, to keep the OS from issuing a spin down.

So it should be safe to run other things at the same time. Apparently the extended test is run in the background at a lower priority, so that any regular I/O can proceed as quickly as possible. I was wrong!

Not sure when it was implemented, but I've recently done an extended test under 6.3rc6 and it automatically disabled spindown for the drive while the test ran.

Quote

January 6, 20179 yr

Community Expert

I decided I better go recheck my assumptions here, and while I didn't find anything authoritative, I have to admit I definitely 'jumped to an assumption'! I must have misunderstood someone way back, that the polling time was the time before you could grab the results, and if you didn't you could interrupt the test. Apparently that's not true. The polling time is just helpful info, the earliest time the test could finish if there's no other I/O to the drive. If there's concurrent I/O, the test will take longer. And I knew that the test could be inadvertently interrupted, aborted I thought because of other I/O, but actually the only known cause is a spin down command, which could I believe explain the unexpected aborts of the long test that I've seen. Some recommend setting up a loop to read a block from the drive every so often, to keep the OS from issuing a spin down.

So it should be safe to run other things at the same time. Apparently the extended test is run in the background at a lower priority, so that any regular I/O can proceed as quickly as possible. I was wrong!

Not sure when it was implemented, but I've recently done an extended test under 6.3rc6 and it automatically disabled spindown for the drive while the test ran.

IIRC this was done on v6.1.something.

Quote

Read errors on newish parity drive

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)