rick.p Posted August 4, 2015 Share Posted August 4, 2015 OK, not gonna get deep in this yet, just some general thoughts please. 6.0.1 in a 10 drive array. Just replaced the Parity Drive (3tb>4tb, precleared x3). Did the parity rebuild and all was good. Then I kicked off the parity check and when I checked this morning, it said it found 80 errors. There are NO errors showing on MAIN tab. This setup has been running for several months with 0 errors every month. DATA POINT: The array was being written to at the time with new files. a) did the writing to the array while the check was running be the culprit? b) UNFORTUNATELY the 'write corrections to disk' was checked... BY DEFAULT. OPINION: DESTRUCTIVE (writing to corrections) should NEVER be a DEFAULT OPTION. c) I've kicked off a new check, WITHOUT write checked. d) am I in deep stuff? Replaced the parity because planned on swapping out some 1tb and 2tb drives for 4's The SMART for the parity, appears clean (said like he knows what he's talking about) smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD40EZRX-00SPEB0 Serial Number: WD-WCC4E5RDX2N5 LU WWN Device Id: 5 0014ee 26006bcf0 Firmware Version: 80.00A80 User Capacity: 4,000,753,476,096 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Aug 3 21:16:25 2015 MDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51120) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 512) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 180 021 Pre-fail Always - 7883 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 109 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 5 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 465 194 Temperature_Celsius 0x0022 123 105 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 103 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Link to comment
SSD Posted August 4, 2015 Share Posted August 4, 2015 OK, not gonna get deep in this yet, just some general thoughts please. 6.0.1 in a 10 drive array. Just replaced the Parity Drive (3tb>4tb, precleared x3). Did the parity rebuild and all was good. Then I kicked off the parity check and when I checked this morning, it said it found 80 errors. There are NO errors showing on MAIN tab. This setup has been running for several months with 0 errors every month. DATA POINT: The array was being written to at the time with new files. a) did the writing to the array while the check was running be the culprit? b) UNFORTUNATELY the 'write corrections to disk' was checked... BY DEFAULT. OPINION: DESTRUCTIVE (writing to corrections) should NEVER be a DEFAULT OPTION. c) I've kicked off a new check, WITHOUT write checked. d) am I in deep stuff? Replaced the parity because planned on swapping out some 1tb and 2tb drives for 4's The SMART for the parity, appears clean (said like he knows what he's talking about) smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD40EZRX-00SPEB0 Serial Number: WD-WCC4E5RDX2N5 LU WWN Device Id: 5 0014ee 26006bcf0 Firmware Version: 80.00A80 User Capacity: 4,000,753,476,096 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Aug 3 21:16:25 2015 MDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51120) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 512) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 180 021 Pre-fail Always - 7883 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 109 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 5 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 465 194 Temperature_Celsius 0x0022 123 105 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 103 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I think you can sleep easy. I'm assuming when you say 20 errors - you are talking about 20 sync errors? After the parity build, parity should be perfectly aligned with data. Typically I/Os from normal write operations would not be happening in the same band that unRAID is doing its checking (seems enormously unlikely), but if it did happen, I suppose it is possible that you could get some contention, and without some sort of semaphore locking accesses, be able to corrupt parity in that way. Only Tom could confirm if this is mildly possible. But corrupting parity is no big deal if all of the data drives are good. I would stop your non-correction check and run a correction check. Then run a non-correction check and confirm 0 errors. It is still a bit of a mystery and I suppose there is a possibility that one of your data disks is flaking out. MD5s would be helpful in this sort of situation. But I'd say odds are well in your favor. Link to comment
rick.p Posted August 4, 2015 Author Share Posted August 4, 2015 Well the first one WAS a correcting check. (see part about not having that checked by default) I still have some writes going on (would be a problem to stop) but I did stop the non correcting check till it finishes. But the damage (if any) is done? Or were the writes just to the parity drive. When the system quiesces I'll run another non-correcting check, shall report back Link to comment
rick.p Posted August 4, 2015 Author Share Posted August 4, 2015 And it was 80 errors, the message was check completed finding 80 errors Link to comment
garycase Posted August 4, 2015 Share Posted August 4, 2015 UnRAID makes all corrections to the parity disk ... the vast majority of sync errors are in fact errors in the parity bit, so it's almost certainly the right thing to do. If you completed the correcting check (with 80 sync errors), a follow-up check (whether correcting or not) SHOULD show zero sync errors, assuming all is working well with your system [since the initial sync errors should have already been corrected]. As for whether or not the system should default to a correcting check ... that's often an area of contention, and you'll get a variety of opinions on that. I always run correcting checks (if there are errors, I want them fixed) EXCEPT after a drive rebuild, when I run a non-correcting check to confirm the rebuild went okay [if it didn't, I want to be able to redo it, and if you change parity at that point you won't be able to]. Link to comment
rick.p Posted August 4, 2015 Author Share Posted August 4, 2015 Well, a simple thought, if I can't undo it, don't do it unless I say to :-) DESTRUCTIVE things should be DELIBERATE :-) Link to comment
garycase Posted August 4, 2015 Share Posted August 4, 2015 It's debatable whether fixing a parity error is "destructive" => but clearly that's how you feel so you should only run non-correcting checks. ... which begs the question: What are you doing to do if the non-correcting check shows that you have a few sync errors? Link to comment
rick.p Posted August 4, 2015 Author Share Posted August 4, 2015 Come back here and panic? :-) Link to comment
rick.p Posted August 4, 2015 Author Share Posted August 4, 2015 And in this case, destructive means something I can't undo, whatever the reason Link to comment
garycase Posted August 4, 2015 Share Posted August 4, 2015 Depends on how you view correcting a bad parity bit => I'd view that as a constructive change, not a destructive one. Doing nothing simply leaves the array in a defective state that means you can't reliably emulate or rebuild a failed disk. Link to comment
rick.p Posted August 4, 2015 Author Share Posted August 4, 2015 Depends on how you view correcting a bad parity bit => I'd view that as a constructive change, not a destructive one. Doing nothing simply leaves the array in a defective state that means you can't reliably emulate or rebuild a failed disk. Let me rephrase that then, a NON-REVERSIBLE operation. Default to NOT doing the correction write, then an extra step/question when you kick it off"you know this will only check not fix? this is not reversible".. IAEF, a configuration option under Settings > Disk Settings "Default Parity Check Operation:" ---- "Check only, no corrections" ----- "Check and correct errors" .. if the latter the extra "you sure you want to spend 14 hours doing this and not actually fix anything?" Link to comment
bkastner Posted August 4, 2015 Share Posted August 4, 2015 Depends on how you view correcting a bad parity bit => I'd view that as a constructive change, not a destructive one. Doing nothing simply leaves the array in a defective state that means you can't reliably emulate or rebuild a failed disk. Let me rephrase that then, a NON-REVERSIBLE operation. Default to NOT doing the correction write, then an extra step/question when you kick it off"you know this will only check not fix? this is not reversible".. IAEF, a configuration option under Settings > Disk Settings "Default Parity Check Operation:" ---- "Check only, no corrections" ----- "Check and correct errors" .. if the latter the extra "you sure you want to spend 14 hours doing this and not actually fix anything?" When taken in the context of the masses this logic doesn't really work, and is not what people are paying for. The majority of people who buy/use UnRAID are expecting it to just work, and to protect their environments. In the event of a disk failure users don't want to hear that even though they've been running UnRAID for 2 years, and sync errors have been reported they were never fixed because users didn't know they had to change a default setting. Educated customer can choose to change default behavior because they either understand the risk, or are willing to take on the risk. The general public.... not so much. UnRAID needs to be configured to protect users against themselves by default and given that the vast majority of parity sync issues appear to be incorrect writes to the parity disk, and not the data disk, it only makes sense to correct parity by default to give it the best chance of being valid and able to assist in the event of a disk failure. Link to comment
garycase Posted August 4, 2015 Share Posted August 4, 2015 UnRAID needs to be configured to protect users against themselves by default and given that the vast majority of parity sync issues appear to be incorrect writes to the parity disk, and not the data disk, it only makes sense to correct parity by default to give it the best chance of being valid and able to assist in the event of a disk failure. Definitely agree. That's why it's always defaulted to that ... and indeed originally there wasn't even an option to change it to non-correcting. The last thing a user wants is for a drive to fail and discover that they can't successfully rebuild it because their parity isn't up-to-date. Link to comment
rick.p Posted August 5, 2015 Author Share Posted August 5, 2015 Well back to the original issue. The story so far.... 1. Replaced my original 3tb parity drive with a 4tb one, yes it was precleared 3 passes. 2. Kicked off Parity rebuild. Ran fine. 3. Started a Parity Check with correct errors enabled. 4. Parity Check finished with 80 sync errors. 5. During the time the Rebuild and Check were running there were multiple file operations running (copying data INTO the array and moving files between drives) so 80 errors that the Check in #3 shoud have corrected. 6. All my file operations have finished. So I kicked off a NON CORRECTING Parity check (it corrected them last time, right?) 7. It is 8 1/2 hours in (6 to go....) and.... it shows >>>> 80 Sync Errors <<<<< Parity-Check in progress. Cancel will stop the Parity-Check. Write corrections to parity disk Total size: 4 TB Elapsed time: 8 hours, 38 minutes Current position: 2.21 TB (55.2 %) Estimated speed: 83.8 MB/sec Estimated finish: 5 hours, 56 minutes Sync errors detected: 80 This is from the tail of the active log.... Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058840 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058848 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058856 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058864 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058872 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058880 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058888 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058896 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058904 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058912 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058920 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058928 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058936 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058944 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059224 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059232 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059240 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059248 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059256 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059264 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059344 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059352 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059360 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059368 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059376 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059384 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059392 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059400 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059408 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059416 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059424 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059432 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059440 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059448 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059456 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059464 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059480 Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059488 Aug 5 02:08:23 PINEWOOD sSMTP[9957]: Creating SSL connection to host Aug 5 02:08:23 PINEWOOD sSMTP[9957]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256 Aug 5 02:08:26 PINEWOOD sSMTP[9957]: Sent mail for *******@gmail.com (221 2.0.0 closing connection u81sm1294126oie.11 - gsmtp) uid=0 username=root outbytes=1010 Aug 5 03:12:07 PINEWOOD kernel: kvm: no hardware support Aug 5 03:12:07 PINEWOOD kernel: kvm: Nested Virtualization enabled Aug 5 03:12:07 PINEWOOD kernel: kvm: Nested Paging enabled Aug 5 03:14:48 PINEWOOD kernel: kvm: already loaded the other module Aug 5 03:15:00 PINEWOOD emhttp: /usr/bin/tail -n 42 -f /var/log/syslog 2>&1 The setup is bog simple, no docker, no kvm, no nada.... only a few plugins plugin: checking unassigned.devices.plg ... plugin: checking preclear.disk.plg ... plugin: checking dynamix.system.temp.plg ... plugin: checking dynamix.system.stats.plg ... plugin: checking dynamix.system.info.plg ... plugin: checking NerdPack.plg ... plugin: checking unRAIDServer.plg ... all up todate Now, the ONLY other incongruity that I just remembered is that on BOTH the correcting and (non) Non-Correcting Check... I had a pre-clear running on a drive connected via USB3 and Unassigned Devices plug in (hey, a data point it a data point). this is the smart report for the Parity drive, as of 2 minutes ago smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD40EZRX-00SPEB0 Serial Number: WD-WCC4E5RDX2N5 LU WWN Device Id: 5 0014ee 26006bcf0 Firmware Version: 80.00A80 User Capacity: 4,000,753,476,096 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Aug 5 03:07:48 2015 MDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51120) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 512) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 180 021 Pre-fail Always - 7883 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 139 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 5 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2687 194 Temperature_Celsius 0x0022 122 105 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 103 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Data Point: the Parity and one data drive are on the M'Board controller, the other 8 drives are on a AOC-SASLP-MV8. Now what? I can assume that this should now be (the 1st check was a correcting check, so it should have corrected?) Replace the 4tb parity with another one? A data drive is flakey? (mind you it's had a dozen parity checks run before, no errors ever) Re-seat all the cables? (again) Wait till the pre-clear is done, remove the drive from the system (Data Point: AFAIK, I've never had it running a preclear with these plugins running before) and try again? This is giving me serious heartburn..... Link to comment
garycase Posted August 5, 2015 Share Posted August 5, 2015 This is a known problem -- can't find the previous thread about it right now (very tired ... heading to bed), but it's been seen before. The sync "errors" are NOT actual errors (that's the good news) ... but clearly there's a problem that's causing the false sync error count. I don't recall what was done (if anything) to resolve this in the previous thread -- you may want to search for it. Otherwise I'll see if I can find it tomorrow afternoon. Link to comment
garycase Posted August 5, 2015 Share Posted August 5, 2015 Found it -- the issue is reported in a couple of threads. This is one of them: http://lime-technology.com/forum/index.php?topic=38359.0 Note that the issue apparently "goes away" if you disable spin-down for your parity drive 8) You might try that to see if that's true for your specific case as well. Link to comment
rick.p Posted August 5, 2015 Author Share Posted August 5, 2015 ok just disabled spindown and restarting the (non-correcting) parity check. It sorta makes sense, why it's 80 (apparently always) errors, all consecutive (drive has to spin down/up), why now (bigger drive takes longer to get where its' going ).... However leaving the parity drive spinning all the time is a PITA (or having to do it manually before the auto check)... ATTENTION POWERS THAT BE, maybe the parity check should AUTOMAGICALLY disable drive spin down while a parity check is running? will report back in 13-14 hours Link to comment
rick.p Posted August 5, 2015 Author Share Posted August 5, 2015 Well didn't have to wait long.... parity drive set to not spin down, same thing.... Total size: 4 TB Elapsed time: 12 hours, 6 minutes Current position: 2.78 TB (69.5 %) Estimated speed: 72.2 MB/sec Estimated finish: 4 hours, 41 minutes Sync errors detected: 80 errors in the same place Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058888 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058896 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058904 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058912 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058920 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058928 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058936 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058944 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059224 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059232 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059240 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059248 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059256 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059264 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059344 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059352 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059360 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059368 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059376 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059384 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059392 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059400 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059408 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059416 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059424 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059432 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059440 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059448 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059456 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059464 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059480 Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059488 should I put in a new 4tb and see what happens? this is starting to get on my nerves addendum.... just noticed this on the drive page Name: Parity Partition size: 3,906,985,768 KB (K=1024) Partition format: GPT: 4K-aligned 3tb? it's a 4tb drive... Device Model: WDC WD40EZRX-00SPEB0 Serial Number: WD-WCC4E5RDX2N5 LU WWN Device Id: 5 0014ee 26006bcf0 Firmware Version: 80.00A80 User Capacity: 4,000,753,476,096 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device: Not in smartctl database [for details use: -P showall] ATA Version: ACS-2 (minor revision not indicated) SATA Version: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time: Wed Aug 5 16:58:24 2015 MDT SMART support: Available - device has SMART capability. SMART support: Enabled SMART overall-health: Passed and it looks like it's getitng the errors just about the time it hits the 3tb mark??? Link to comment
RobJ Posted August 6, 2015 Share Posted August 6, 2015 Partition size: 3,906,985,768 KB (K=1024) 3tb? it's a 4tb drive... 3,906,985,768 KB is 4TB. Link to comment
rick.p Posted August 6, 2015 Author Share Posted August 6, 2015 Oh duh, it's a 9 :-( like I said this is making me a little squirrely Link to comment
rick.p Posted August 6, 2015 Author Share Posted August 6, 2015 So, back to the question. If this is a known error, is it safe to rebuild a drive? This whole exercise started when I upped the parity to 4tb so I could swap out some 1tb to 4tb. Link to comment
garycase Posted August 6, 2015 Share Posted August 6, 2015 First you need to confirm it's not real. The number you're seeing (80) is different than the ones who have reported similar issues in the other threads ... but since you're doing non-correcting checks it's not absolutely clear that this is the same issue. I'd (a) disable spin-down for the parity drive; and (b) run two correcting checks in a row and see if you get the same results each time. If so, then it's reasonably certain this is the same issue that was reported in the thread I referenced earlier. If not, you may find that the first check actually resolves the errors and you no longer get them on the second one. Link to comment
rick.p Posted August 6, 2015 Author Share Posted August 6, 2015 Actually the 1st check as a correcting check (the one right after the I did the parity rebuild) but spindown was still active... so ok spindown off, two correcting checks. It DOES APPEAR to be the same spot on each check, (same block of sectors). Will report back in 30+ hours.... I will add that this system has been the same hardware for over a year, almost 2. The ONLY thing I did was change out the 3tb parity for that 4tb.. but onward.... Link to comment
rick.p Posted August 7, 2015 Author Share Posted August 7, 2015 Small update, 7 hours in and.... Parity-Check in progress. Cancel will stop the Parity-Check. Write corrections to parity disk Total size: 4 TB Elapsed time: 7 hours Current position: 1.83 TB (45.8 %) Estimated speed: 68.0 MB/sec Estimated finish: 8 hours, 51 minutes Sync errors corrected: 80 it's already found the 'spot', check the log and same batch of sectors, noticed it when it was 1,83tb in (so that blows one of my theories) but no idea where it hit 'it'. Will let it finish and then start pass 2. I was reasonably smart, rebooted the system before starting this so the log file will have NOTHING but this in it. 9 hours till pass 2 starts. Link to comment
garycase Posted August 7, 2015 Share Posted August 7, 2015 I'd say it's virtually certain that these are not actual errors ... but unless you have either MD5's or a complete set of backups you can compare your data against, it's not possible to be absolutely certainty. When you disabled spindown, did you do it for the DRIVE (the parity drive) ... not just the global setting. That's what seemed to resolve this in the other thread. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.