Disk errors not detected in pre-clear or SMART


Recommended Posts

First off, there is no data at risk.

When this origninally occured, I ran a non correcting parity check to confirm there were zero errors, then I replaced the drive which rebuilt sucessfully.

 

I am now deciding what to do with the removed drive and if I need to change my preclear approach to include a review of server logs to look for unreported errors.

 

Details:

 

A new shucked WD He10 8TB drive passed 1 x WD test then 1 x pre clear cycle.

On being added to the array after 6 months as a hot spare, it generated errors after being added to the array when >10% of drive capacity used. 

The errors were shown in the main Unraid page and logged as SMART 001 'Raw read error rate'. The  'Value / Worst' smart reading actually increased from 099/099 to 100/100 after the non correcting parity check was complete. 

The Smart 101 Error count reset on reboot

Swapping to a different controller port did not eliminate the errors.

Extended smart passed with no errors

Smart data is clean, no pending allocation / reallocated sectors etc

Moved the drive to a different server, different controller cables etc.

Ran another preclear cycle read/write/read and got errors between 10 and 20% of disk scan for the initial read.

No errors in the write or second read.

No errors in SMART

 

############################################################################################################################
#                                                                                                                          #
#                                         unRAID Server Preclear of disk 2SGAWHDJ                                          #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 5 - Pre-read verification:                                                  [17:03:16 @ 130 MB/s] SUCCESS    #
#   Step 2 of 5 - Zeroing the disk:                                                       [14:53:06 @ 149 MB/s] SUCCESS    #
#   Step 3 of 5 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 4 of 5 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#   Step 5 of 5 - Post-Read verification:                                                 [16:04:22 @ 138 MB/s] SUCCESS    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                              Cycle elapsed time: 48:00:48 | Total elapsed time: 48:00:48                                 #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                        S.M.A.R.T. Status (device type: default)                                          #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  CYCLE 1  STATUS                                                                  #
#   5-Reallocated_Sector_Ct      0        0        -                                                                       #
#   9-Power_On_Hours             5749     5798     Up 49                                                                   #
#   194-Temperature_Celsius      32       32       -                                                                       #
#   196-Reallocated_Event_Count  0        0        -                                                                       #
#   197-Current_Pending_Sector   0        0        -                                                                       #
#   198-Offline_Uncorrectable    0        0        -                                                                       #
#   199-UDMA_CRC_Error_Count     0        0        -                                                                       #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################


--> ATTENTION: Please take a look into the SMART report above for drive health issues.

--> RESULT: Preclear Finished Successfully!.

 

The log is spammed with these errors between 10% and 20% of drive capacity

The drive is sdb / ata1.00

 

image.thumb.png.7fc8bd61ea10a781566a82ba0d9266d7.png

 

The pre read zone 10-20% was slow due to these errors.

Oct 17 21:23:15 MyNas preclear_disk_2SGAWHDJ[18758]: Pre-Read: progress - 20% read @ 98 MB/s

 

Wheras the post read ran as expected with no errors and at the expected speed

Oct 19 03:35:05 MyNas preclear_disk_2SGAWHDJ[18758]: Post-Read: progress - 10% verified @ 187 MB/s
Oct 19 04:49:52 MyNas preclear_disk_2SGAWHDJ[18758]: Post-Read: progress - 20% verified @ 167 MB/s
Oct 19 06:07:32 MyNas preclear_disk_2SGAWHDJ[18758]: Post-Read: progress - 30% verified @ 161 MB/s
Oct 19 07:29:11 MyNas preclear_disk_2SGAWHDJ[18758]: Post-Read: progress - 40% verified @ 156 MB/s

 

 

 

So based on this being used in two totally independant servers, the errors must be the drive. The questions I have are.

 

1) is the disk actually good? the errors seem to reduce with usage. Is this part of some early life adaption process? 

If this was just some early life behaviour then I'd rather keep it and it's a He10 rather than junk it or attempt a return.

Currently there are no 'faults' to return it for. 

 

2) Why do these errors not show up in SMART / Preclear, there are the same errors between the servers at the same points.

Given there were originally SMART 001 errors when the drive was part of the array, I am suprised not to see them now after the read failures.

 

3) I have some more drives to pre-clear, feels like I need to keep a close eye on the logs as well as just rely on SMART/Preclear

Does anyone have any experience of anything similar?

 

Interested in your thoughts.

 

Edit:

I don't have the logs for the original error on the other server or the extended SMART test I ran.

Latest logs attached.

 

 

 

 

mynas-diagnostics-20201019-1943.zip

Edited by Decto
Link to comment

If it is still under warranty, I'd start that process.

I don't have any helpful advice on your current issue.  If you don't get anything directly helpful to your question about if the drive is good or not - I'd run a 3x preclear on it and see what the logs say. In my limited experience the only drives I've had that failed preclear (4 of them),  failed after the first cycle, usually during the 2nd cycle (and once in the 3rd cycle). Every drive I have that passed 3 preclear cycles was retired by choice rather than necessity.

Link to comment
3 hours ago, whipdancer said:

If it is still under warranty, I'd start that process.

I don't have any helpful advice on your current issue.  If you don't get anything directly helpful to your question about if the drive is good or not - I'd run a 3x preclear on it and see what the logs say. In my limited experience the only drives I've had that failed preclear (4 of them),  failed after the first cycle, usually during the 2nd cycle (and once in the 3rd cycle). Every drive I have that passed 3 preclear cycles was retired by choice rather than necessity.

 

RMA is a sticky point as all the SMART data is good. Usually you need some sort of logged failure like reallocated sectors etc.

So I am a bit stuck until it really fails.

 

Currently running another set of preclearing

It's already passed through the 'issue' area on the first pass, interstingly it also seems to be running faster than previously.

Oct 19 22:11:08 preclear_disk_2SGAWHDJ_28904: Pre-Read: progress - 10% read @ 192 MB/s
Oct 19 23:24:22 preclear_disk_2SGAWHDJ_28904: Pre-Read: progress - 20% read @ 186 MB/s

 

Current plan is to run another cycle or two and see what happens. If it holds up I'll throw it in as parity 2 to give it a good workout and get some data for interest.

 

Had I run the full three cycles up front, it may well have 'worked through' this issue since nothing gets reported.

 

That's also another reason for discussing here. These errors may have been in the first clearance I ran many months ago, however as they don't report and I wasn't looking for them in the logs, I could easily have been sitting on a precleared but defective drive which would have fallen over in a rebuild.

 

 

 

Link to comment
17 hours ago, Decto said:

 

RMA is a sticky point as all the SMART data is good. Usually you need some sort of logged failure like reallocated sectors etc.

So I am a bit stuck until it really fails.

Depends on how you approach it. I've never been denied an RMA, regardless of reason. Performance issues regardless of SMART status is a valid reason to RMA.

Link to comment

So an update.

 

Summary of intial post

When originally added to an array, the drive had read errors in the 10-20% zone, these were not sufficient to trigger smart or pending reallocaiton and a non correcting parity check did not find any errors. These errors seemed to be during write as if the drive was re-reading what it had just written.

 

A pre clear (Read/Zero/Read) in a completely different system had read errors on the pre-read in the 10-20% zone but no errors in the Zero or post read.

 

 

Update

 

The drive has since had two additional full pre-clear cycles (Read,Zero,Read) which completed at the expected speed and without any errors in the logs.

I then added it as parity 2, let the parity rebuild and then ran a non correcting check.

Not a single error in the logs and smart data remains clean.

 

So I guess for now I'll keep it in the parity 2 slot.

I'm about to add a 5th array drive with a 6th already pre-cleared for future expansion (prime day deals)

 

There is little information about how drives behave so I wonder if the sectors were initally a little lazy or if the drive in someway has compensated.

I have read that all drives are effectively read and written to in the factory post assembly and during this any substandard sectors are excluded. I was curious if a few sectors would get remapped and then the drive would behave normally, however it seems to have returned to normal without any remapped sectors.

 

Lets see how it gets on.

 

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.