Parity Drive Disabled Due to Errors, Best Option to Re-enable and Correct Parity?


Recommended Posts

New to setting up and using unRAID so please bear with me.  I initially had a 4TB data drive and a 6TB drive for parity.  I added two 8TB drives as parity drives to expand to 10TB of storage with additional redundancy, and to allow for storage expansion using 8TB drives.

 

The steps I followed were:

1) Added an 8TB drive (sde) to the array as Parity 2, and had the system build parity on it.

2) Unassigned the 6TB drive (sdb), reassigned it as Disk 2, and had the array preclear it.

3) Added the second 8TB drive (sdd) to the array and started building parity on it as well.

 

About halfway through the parity build on sdd, the system posted a notification that there were errors on sde and disabled the drive.  After the parity build on sdd completed I ran an extended self test on sde.  The SMART test results (following this post) give health status of PASS, and I believe the issue to be a single bad sector.  I plan to shutdown and clean/re-seat all connections just in case.

 

Sdd (Parity), sdb (Disk 2), and sdc (Disk 1) currently show normal-operation/active, while sde shows disabled.  From what i can tell my options are either:

 

a) unassign/reassign the sde drive to Parity 2 which will then be entirely rebuild; or

b) perform a new config, and run a parity check with "Write corrections to parity" enabled wich will fix the problems on sde (Parity 2).

 

Are there any other options or steps to be aware of?  What are the relative pros and cons of the forgoing options?  I'd prefer not to have to rebuild parity but if there is a distinct advantage to it, I'm happy to do so.  Thanks!

 

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 991) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x30a5)    SCT Status supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   100   064   006    -    5750
  3 Spin_Up_Time            PO----   092   091   000    -    0
  4 Start_Stop_Count        -O--CK   098   098   020    -    2650
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   081   060   045    -    141031912
  9 Power_On_Hours          -O--CK   087   087   000    -    11851 (133 112 0)
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    673
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   099   099   000    -    1
188 Command_Timeout         -O--CK   099   099   000    -    0 0 3
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   073   055   040    -    27 (Min/Max 23/35)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    388
193 Load_Cycle_Count        -O--CK   097   097   000    -    6260
194 Temperature_Celsius     -O---K   027   045   000    -    27 (0 21 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   100   064   000    -    5750
197 Current_Pending_Sector  -O--C-   100   100   000    -    8
198 Offline_Uncorrectable   ----C-   100   100   000    -    8
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    7958h+18m+28.883s
241 Total_LBAs_Written      ------   100   253   000    -    82290957455
242 Total_LBAs_Read         ------   100   253   000    -    68402183133
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    512  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      24  Device vendor specific log
0xa2       GPL     VS    8160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    9048  Device vendor specific log
0xbd       GPL     VS       8  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      16  Device vendor specific log
0xc3       GPL,SL  VS       8  Device vendor specific log
0xc4       GPL,SL  VS      24  Device vendor specific log
0xd1       GPL     VS     264  Device vendor specific log
0xd3       GPL     VS    1920  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 1
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 11840 hours (493 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 7a 54 10 08 00 00  Error: UNC at LBA = 0x17a541008 = 6347296776

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 00 04 00 00 01 7a 54 0e c0 40 00  1d+07:28:33.300  READ FPDMA QUEUED
  60 00 00 04 00 00 01 7a 54 0a c0 40 00  1d+07:28:33.296  READ FPDMA QUEUED
  60 00 00 03 b8 00 01 7a 3f b6 c0 40 00  1d+07:28:29.005  READ FPDMA QUEUED
  60 00 00 04 00 00 01 7a 3f b2 c0 40 00  1d+07:28:29.002  READ FPDMA QUEUED
  60 00 00 04 00 00 01 7a 3f ae c0 40 00  1d+07:28:28.998  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     11851         6347296776
# 2  Short offline       Aborted by host               90%     11841         -
# 3  Short offline       Completed without error       00%     11835         -

 

Link to comment

Not much difference between your options as both of them require every sector on the disabled parity disk to be accessed.   The only difference is that in one you are reading every sector and in the other you are writing them.    I personally would go with rebuilding parity as since you have already had a write  to the drive fail (which is why it was disabled in the first place) you now want to know if you can reliably write to that drive without errors as if not the drive will need replacing.

Link to comment
Just now, rsbonini said:

So from what I can tell, it was a failure on read as it occurred while building the other parity drive.  I don't think this changes your point (and barring any other input I'll take your advice), but wanted to clarify. 

No - it was a write failure as that is the only time unRaid disables a drive.    The write failure could have been triggered by a read failure which subsequently caused unRaid to then try and correct it by rewriting the sector it had just failed to read and that write failed.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.