same drive number failed... twice


tiwing

Recommended Posts

Hi all.

 

My server emailed me at 1:30 this morning. Had drive 6 go out on me, again. First was an old 4TB red and I just assumed it had gone bad. So replaced with a 10TB Red. less than 85 hours later it has also failed after a full rebuild (24 hours) plus a couple days usage. I swapped cables between two drives and looked still not good (would is self-correct during power on in unraid or once it's got a red X it requires manual intervention?).  Considering raw_read_error_rate is zero I've unassigned the drive and reassigned and it's currently rebuilding. (All critical data is backed up twice in my house and offsite. Non critical data is backed up only in my house, and Plex stuff I just don't care about isn't backed up, but is on the array.)

 

By swapping cables and the drive didn't come back to life, I think I've eliminated the expansion card and the cable. (? right ?)

 

I've had 2 unraid boxes running for 4+ years 24/7. Had 10 drives running in one for over a year. Nothing else changed in the last 6 months.. and this is the first failure ever. Basic server info - Thinkstation s20 in a new tower with Xeon 3550 CPU, 24GB EEC, plugged into UPS. All drive are WD reds of various ages and sizes.

Question: based on below SMART report (do you need anything else to help analyze?) - should I be in touch with Western Digital, go back the store for a replacement, or is this just a *shrug* "weird s*&% happens" kind of thing and see how it goes for another week?

 

Thanks for your help.

tiwing.

 

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   130   130   054    -    108
  3 Spin_Up_Time            POS---   100   100   024    -    0
  4 Start_Stop_Count        -O--C-   100   100   000    -    3
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    85
 10 Spin_Retry_Count        -O--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    3
 22 Helium_Level            PO---K   100   100   025    -    100
192 Power-Off_Retract_Count -O--CK   100   100   000    -    140
193 Load_Cycle_Count        -O--C-   100   100   000    -    140
194 Temperature_Celsius     -O----   191   191   000    -    34 (Min/Max 23/41)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0

 


Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               3  ---  Lifetime Power-On Resets
0x01  0x010  4              85  ---  Power-on Hours
0x01  0x018  6     20935128920  ---  Logical Sectors Written
0x01  0x020  6        20521877  ---  Number of Write Commands
0x01  0x028  6      1958493232  ---  Logical Sectors Read
0x01  0x030  6         4981555  ---  Number of Read Commands
0x01  0x038  6       307138550  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              51  ---  Spindle Motor Power-on Hours
0x03  0x010  4              51  ---  Head Flying Hours
0x03  0x018  4             140  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              35  N--  Average Short Term Temperature
0x05  0x018  1               -  N--  Average Long Term Temperature
0x05  0x020  1              41  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              40  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1               -  N--  Highest Average Long Term Temperature
0x05  0x048  1               -  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               5  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

 

Pending Defects log (GP Log 0x0c)
No Defects Logged

 

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            3  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Link to comment
14 minutes ago, tiwing said:

once it's got a red X it requires manual intervention

This. When a write fails, unraid disables the drive and all further writes to that slot are done to the emulated drive calculated from all the other drives. What was on the disabled drive is no longer used at all, and to put the drive back in the array requires rebuilding the entire drive.

 

If you want any kind of informed advice on how to proceed, you need to attach the diagnostics zip file to your next post to this thread.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.