Failed/Offline Disk


Netbug

Recommended Posts

UnRaid Version 6.9.2

 

I noticed that the system was responding very sluggishly for some reason so I had a quick look, and disk 8 was showing offline. I stopped the array, removed the drive from the array, started array, stopped array, re-assigned, started, and waited for rebuild. Everything seemed fine. Then the next day, the same thing happened.

 

This is where I screwed up. I completely removed the drive and attempted to pre-clear it (data wasn't super important). Left it for about 30 hours, and came back to a message of "Error encountered, please verify the log".

 

I now have a replacement drive, which I will be installing now, but I've got a few of these drives that seem to fail and i'm not knowledgeable enough to understand if they are actually dead. I've tried reading through the "Understanding SMART Reports" article but I'm just not smart enough to get it.

 

I'm attaching logs here. My questions are:

 

1. Is there any way to know, from the logs and information, what happened?

2. Is there a way that I can purchase something (like an external SATA connector) for my Windows PC and use the windows machine to check these supposedly failed drives?

 

Thanks.

preclear log.png

Preclear Overview.PNG

tower-diagnostics-20211118-0549.zip tower-syslog-20211118-1048.zip

Link to comment

drive Z5029NBR has only a very old SMART test but the attributes do not look good (in particular #197 & 198 but also #5):

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   086   083   006    -    202923257
  3 Spin_Up_Time            PO----   095   094   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    179
  5 Reallocated_Sector_Ct   PO--CK   082   082   010    -    22944
  7 Seek_Error_Rate         POSR--   075   060   030    -    36487737
  9 Power_On_Hours          -O--CK   044   044   000    -    49544
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    177
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   001   001   000    -    1152
188 Command_Timeout         -O--CK   097   092   000    -    14 14 31
189 High_Fly_Writes         -O-RCK   088   088   000    -    12
190 Airflow_Temperature_Cel -O---K   071   064   045    -    29 (Min/Max 25/36)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    30
193 Load_Cycle_Count        -O--CK   001   001   000    -    221440
194 Temperature_Celsius     -O---K   029   040   000    -    29 (0 13 0 0 0)
197 Current_Pending_Sector  -O--C-   099   089   000    -    280
198 Offline_Uncorrectable   ----C-   099   089   000    -    280
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    6600h+36m+56.005s
241 Total_LBAs_Written      ------   100   253   000    -    121723793480
242 Total_LBAs_Read         ------   100   253   000    -    1372755437944

 

You could try to run an extended SMART test, but don't get your hopes up.

 

Regarding what happened, not sure myself, maybe another user with more knowledge can chime in.

I'd guess that it is just an old drive ? (+49500 of power ON hours)

  • Like 1
Link to comment

Thank you for the replies. I still don't understand quite how to interpret those results. I'll have to dig in to what Current_Pending_Sector and Offline_Uncorrectable mean and what thresholds are acceptable.

 

Any recommendations for a Windows utility to test drives (I know it's slightly off-topic)?

Link to comment
4 minutes ago, Netbug said:

I'll have to dig in to what Current_Pending_Sector and Offline_Uncorrectable mean and what thresholds are acceptable.

You can start there : https://en.wikipedia.org/wiki/S.M.A.R.T.

 

For #198 the only acceptable should be 0;

For #197, it should not stay above 0 for too long. The should go from Pending to Reallocated (#5), but there is only a limited amount of reserve sectors the drive can use.

  • Like 1
Link to comment
1 hour ago, Netbug said:

I'll have to dig in to what Current_Pending_Sector and Offline_Uncorrectable mean and what thresholds are acceptable.

The thresholds in SMART is when the manufacturer outright states that the drive is failing.  However, they tend to be very skewed towards the manufacturer's best interests on some attributes.

 

In particular attribute 5 the value vs the threshold shows that the drive is no where near failing.  However 23000 reallocated sectors already (and more coming) shows that the drive is basically toast.  Many users think that a single reallocated sector is grounds to replace a drive.  I can deal with ~100 before I start to get worried.  Over a hundred and I'll order a new drive online.  At 23000, I'd be going to the nearest brick and mortar store.

 

Do you have notifications set up?  The OS would have warned you about this presumably a long time ago.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.