Parity Drive Disabled Due to Errors, Best Option to Re-enable and Correct Parity?

rsbonini · April 19, 2021

New to setting up and using unRAID so please bear with me. I initially had a 4TB data drive and a 6TB drive for parity. I added two 8TB drives as parity drives to expand to 10TB of storage with additional redundancy, and to allow for storage expansion using 8TB drives.

The steps I followed were:

1) Added an 8TB drive (sde) to the array as Parity 2, and had the system build parity on it.

2) Unassigned the 6TB drive (sdb), reassigned it as Disk 2, and had the array preclear it.

3) Added the second 8TB drive (sdd) to the array and started building parity on it as well.

About halfway through the parity build on sdd, the system posted a notification that there were errors on sde and disabled the drive. After the parity build on sdd completed I ran an extended self test on sde. The SMART test results (following this post) give health status of PASS, and I believe the issue to be a single bad sector. I plan to shutdown and clean/re-seat all connections just in case.

Sdd (Parity), sdb (Disk 2), and sdc (Disk 1) currently show normal-operation/active, while sde shows disabled. From what i can tell my options are either:

a) unassign/reassign the sde drive to Parity 2 which will then be entirely rebuild; or

b) perform a new config, and run a parity check with "Write corrections to parity" enabled wich will fix the problems on sde (Parity 2).

Are there any other options or steps to be aware of? What are the relative pros and cons of the forgoing options? I'd prefer not to have to rebuild parity but if there is a distinct advantage to it, I'm happy to do so. Thanks!

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00)   Offline data collection activity
                   was never started.
                   Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121)   The previous self-test completed having
                   the read element of the test failed.
Total time to complete Offline
data collection:        ( 0) seconds.
Offline data collection
capabilities:            (0x73) SMART execute Offline immediate.
                   Auto Offline data collection on/off support.
                   Suspend Offline collection upon new
                   command.
                   No Offline surface scan supported.
                   Self-test supported.
                   Conveyance Self-test supported.
                   Selective Self-test supported.
SMART capabilities: (0x0003)   Saves SMART data before entering
                   power-saving mode.
                   Supports SMART auto save timer.
Error logging capability: (0x01)   Error logging supported.
                   General Purpose Logging supported.
Short self-test routine
recommended polling time:    ( 1) minutes.
Extended self-test routine
recommended polling time:    ( 991) minutes.
Conveyance self-test routine
recommended polling time:    ( 2) minutes.
SCT capabilities:    (0x30a5)   SCT Status supported.
                   SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 100 064 006 - 5750
3 Spin_Up_Time PO---- 092 091 000 - 0
4 Start_Stop_Count -O--CK 098 098 020 - 2650
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 081 060 045 - 141031912
9 Power_On_Hours -O--CK 087 087 000 - 11851 (133 112 0)
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 673
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 099 099 000 - 1
188 Command_Timeout -O--CK 099 099 000 - 0 0 3
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 073 055 040 - 27 (Min/Max 23/35)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 388
193 Load_Cycle_Count -O--CK 097 097 000 - 6260
194 Temperature_Celsius -O---K 027 045 000 - 27 (0 21 0 0 0)
195 Hardware_ECC_Recovered -O-RC- 100 064 000 - 5750
197 Current_Pending_Sector -O--C- 100 100 000 - 8
198 Offline_Uncorrectable ----C- 100 100 000 - 8
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
240 Head_Flying_Hours ------ 100 253 000 - 7958h+18m+28.883s
241 Total_LBAs_Written ------ 100 253 000 - 82290957455
242 Total_LBAs_Read ------ 100 253 000 - 68402183133
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x21 GPL R/O 1 Write stream error log
0x22 GPL R/O 1 Read stream error log
0x24 GPL R/O 512 Current Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa1 GPL,SL VS 24 Device vendor specific log
0xa2 GPL VS 8160 Device vendor specific log
0xa6 GPL VS 192 Device vendor specific log
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
0xab GPL VS 1 Device vendor specific log
0xb0 GPL VS 9048 Device vendor specific log
0xbd GPL VS 8 Device vendor specific log
0xbe-0xbf GPL VS 65535 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL,SL VS 16 Device vendor specific log
0xc3 GPL,SL VS 8 Device vendor specific log
0xc4 GPL,SL VS 24 Device vendor specific log
0xd1 GPL VS 264 Device vendor specific log
0xd3 GPL VS 1920 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 1
   CR = Command Register
   FEATR = Features Register
   COUNT = Count (was: Sector Count) Register
   LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
   LH = LBA High (was: Cylinder High) Register ] LBA
   LM = LBA Mid (was: Cylinder Low) Register ] Register
   LL = LBA Low (was: Sector Number) Register ]
   DV = Device (was: Device/Head) Register
   DC = Device Control Register
   ER = Error register
   ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 11840 hours (493 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 01 7a 54 10 08 00 00 Error: UNC at LBA = 0x17a541008 = 6347296776

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 00 04 00 00 01 7a 54 0e c0 40 00 1d+07:28:33.300 READ FPDMA QUEUED
60 00 00 04 00 00 01 7a 54 0a c0 40 00 1d+07:28:33.296 READ FPDMA QUEUED
60 00 00 03 b8 00 01 7a 3f b6 c0 40 00 1d+07:28:29.005 READ FPDMA QUEUED
60 00 00 04 00 00 01 7a 3f b2 c0 40 00 1d+07:28:29.002 READ FPDMA QUEUED
60 00 00 04 00 00 01 7a 3f ae c0 40 00 1d+07:28:28.998 READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 11851 6347296776
# 2 Short offline Aborted by host 90% 11841 -
# 3 Short offline Completed without error 00% 11835 -

itimpi · April 19, 2021

Not much difference between your options as both of them require every sector on the disabled parity disk to be accessed. The only difference is that in one you are reading every sector and in the other you are writing them. I personally would go with rebuilding parity as since you have already had a write to the drive fail (which is why it was disabled in the first place) you now want to know if you can reliably write to that drive without errors as if not the drive will need replacing.

rsbonini · April 19, 2021

So from what I can tell, it was a failure on read as it occurred while building the other parity drive. I don't think this changes your point (and barring any other input I'll take your advice), but wanted to clarify.

itimpi · April 19, 2021

Just now, rsbonini said:

So from what I can tell, it was a failure on read as it occurred while building the other parity drive. I don't think this changes your point (and barring any other input I'll take your advice), but wanted to clarify.

No - it was a write failure as that is the only time unRaid disables a drive. The write failure could have been triggered by a read failure which subsequently caused unRaid to then try and correct it by rewriting the sector it had just failed to read and that write failed.

rsbonini · April 19, 2021

Ok, interesting. Thank you for the info, much appreciated.

Edited April 19, 2021 by rsbonini

Parity Drive Disabled Due to Errors, Best Option to Re-enable and Correct Parity?

Recommended Posts

rsbonini

Link to comment

itimpi

Link to comment

rsbonini

Link to comment

itimpi

Link to comment

rsbonini

Link to comment

Join the conversation