I may have a problem, but I may have done it to myself


Recommended Posts

OK, not gonna get deep in this yet, just some general thoughts please.

 

6.0.1 in a 10 drive array. Just replaced the Parity Drive (3tb>4tb, precleared x3). Did the parity rebuild and all was good.

Then I kicked off the parity check and when I checked this morning, it said it found 80 errors.

There are NO errors showing on MAIN tab. This setup has been running for several months with 0 errors every month.

 

DATA POINT: The array was being written to at the time with new files.

 

a) did the writing to the array while the check was running be the culprit?

b) UNFORTUNATELY the 'write corrections to disk' was checked... BY DEFAULT.

    OPINION: DESTRUCTIVE (writing to corrections) should NEVER be a DEFAULT OPTION.

c) I've kicked off a new check, WITHOUT write checked.

d) am I in deep stuff? Replaced the parity because planned on swapping out some 1tb and 2tb drives for 4's

 

The SMART for the parity, appears clean (said like he knows what he's talking about)
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EZRX-00SPEB0
Serial Number:    WD-WCC4E5RDX2N5
LU WWN Device Id: 5 0014ee 26006bcf0
Firmware Version: 80.00A80
User Capacity:    4,000,753,476,096 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Aug  3 21:16:25 2015 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(51120) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 512) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   180   021    Pre-fail  Always       -       7883
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       109
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       465
194 Temperature_Celsius     0x0022   123   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       103         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Link to comment

OK, not gonna get deep in this yet, just some general thoughts please.

 

6.0.1 in a 10 drive array. Just replaced the Parity Drive (3tb>4tb, precleared x3). Did the parity rebuild and all was good.

Then I kicked off the parity check and when I checked this morning, it said it found 80 errors.

There are NO errors showing on MAIN tab. This setup has been running for several months with 0 errors every month.

 

DATA POINT: The array was being written to at the time with new files.

 

a) did the writing to the array while the check was running be the culprit?

b) UNFORTUNATELY the 'write corrections to disk' was checked... BY DEFAULT.

    OPINION: DESTRUCTIVE (writing to corrections) should NEVER be a DEFAULT OPTION.

c) I've kicked off a new check, WITHOUT write checked.

d) am I in deep stuff? Replaced the parity because planned on swapping out some 1tb and 2tb drives for 4's

 

The SMART for the parity, appears clean (said like he knows what he's talking about)
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EZRX-00SPEB0
Serial Number:    WD-WCC4E5RDX2N5
LU WWN Device Id: 5 0014ee 26006bcf0
Firmware Version: 80.00A80
User Capacity:    4,000,753,476,096 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Aug  3 21:16:25 2015 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(51120) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 512) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   180   021    Pre-fail  Always       -       7883
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       109
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       465
194 Temperature_Celsius     0x0022   123   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       103         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I think you can sleep easy.

 

I'm assuming when you say 20 errors - you are talking about 20 sync errors?

 

After the parity build, parity should be perfectly aligned with data. Typically I/Os from normal write operations would not be happening in the same band that unRAID is doing its checking (seems enormously unlikely), but if it did happen, I suppose it is possible that you could get some contention, and without some sort of semaphore locking accesses, be able to corrupt parity in that way. Only Tom could confirm if this is mildly possible.

 

But corrupting parity is no big deal if all of the data drives are good. I would stop your non-correction check and run a correction check. Then run a non-correction check and confirm 0 errors.

 

It is still a bit of a mystery and I suppose there is a possibility that one of your data disks is flaking out. MD5s would be helpful in this sort of situation. But I'd say odds are well in your favor.

Link to comment

Well the first one WAS a correcting check. (see part about not having that checked by default) I still have some writes going on (would be a problem to stop) but I did stop the non correcting check till it finishes. But the damage (if any) is done? Or were the writes just to the parity drive. When the system quiesces I'll run another non-correcting check, shall report back

Link to comment

UnRAID makes all corrections to the parity disk ... the vast majority of sync errors are in fact errors in the parity bit, so it's almost certainly the right thing to do.

 

If you completed the correcting check (with 80 sync errors), a follow-up check (whether correcting or not) SHOULD show zero sync errors, assuming all is working well with your system [since the initial sync errors should have already been corrected].

 

As for whether or not the system should default to a correcting check ... that's often an area of contention, and you'll get a variety of opinions on that.    I always run correcting checks (if there are errors, I want them fixed)  EXCEPT after a drive rebuild, when I run a non-correcting check to confirm the rebuild went okay [if it didn't, I want to be able to redo it, and if you change parity at that point you won't be able to].

 

Link to comment

Depends on how you view correcting a bad parity bit => I'd view that as a constructive change, not a destructive one.  Doing nothing simply leaves the array in a defective state that means you can't reliably emulate or rebuild a failed disk.

 

Let me rephrase that then, a NON-REVERSIBLE operation. Default to NOT doing the correction write, then an extra step/question when you kick it off"you know this will only check not fix? this is not reversible"..

 

IAEF, a configuration option under Settings > Disk Settings

"Default Parity Check Operation:"

----  "Check only, no corrections"

----- "Check and correct errors"

 

.. if the latter the extra "you sure you want to spend 14 hours doing this and not actually fix anything?"

Link to comment

Depends on how you view correcting a bad parity bit => I'd view that as a constructive change, not a destructive one.  Doing nothing simply leaves the array in a defective state that means you can't reliably emulate or rebuild a failed disk.

 

Let me rephrase that then, a NON-REVERSIBLE operation. Default to NOT doing the correction write, then an extra step/question when you kick it off"you know this will only check not fix? this is not reversible"..

 

IAEF, a configuration option under Settings > Disk Settings

"Default Parity Check Operation:"

----  "Check only, no corrections"

----- "Check and correct errors"

 

.. if the latter the extra "you sure you want to spend 14 hours doing this and not actually fix anything?"

 

When taken in the context of the masses this logic doesn't really work, and is not what people are paying for. The majority of people who buy/use UnRAID are expecting it to just work, and to protect their environments. In the event of a disk failure users don't want to hear that even though they've been running UnRAID for 2 years, and sync errors have been reported they were never fixed because users didn't know they had to change a default setting.

 

Educated customer can choose to change default behavior because they either understand the risk, or are willing to take on the risk. The general public.... not so much.

 

UnRAID needs to be configured to protect users against themselves by default and given that the vast majority of parity sync issues appear to be incorrect writes to the parity disk, and not the data disk, it only makes sense to correct parity by default to give it the best chance of being valid and able to assist in the event of a disk failure.

Link to comment

UnRAID needs to be configured to protect users against themselves by default and given that the vast majority of parity sync issues appear to be incorrect writes to the parity disk, and not the data disk, it only makes sense to correct parity by default to give it the best chance of being valid and able to assist in the event of a disk failure.

 

Definitely agree.  That's why it's always defaulted to that ... and indeed originally there wasn't even an option to change it to non-correcting.    The last thing a user wants is for a drive to fail and discover that they can't successfully rebuild it because their parity isn't up-to-date.

 

Link to comment

Well back to the original issue.

 

The story so far....

 

1. Replaced my original 3tb parity drive with a 4tb one, yes it was precleared 3 passes.

2. Kicked off Parity rebuild. Ran fine.

3. Started a Parity Check with correct errors enabled.

4. Parity Check finished with 80 sync errors.

5. During the time the Rebuild and Check were running there were multiple file operations running (copying data INTO the array and moving files between drives)

 

so 80 errors that the Check in #3 shoud have corrected.

 

6. All my file operations have finished. So I kicked off a NON CORRECTING Parity check (it corrected them last time, right?)

7. It is 8 1/2 hours in (6 to go....) and.... it shows >>>> 80 Sync Errors <<<<<

 

Parity-Check in progress.

Cancel will stop the Parity-Check.

Write corrections to parity disk

Total size: 4 TB

Elapsed time: 8 hours, 38 minutes

Current position: 2.21 TB (55.2 %)

Estimated speed: 83.8 MB/sec

Estimated finish: 5 hours, 56 minutes

Sync errors detected: 80

 

This is from the tail of the active log....

 

Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058840
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058848
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058856
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058864
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058872
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058880
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058888
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058896
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058904
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058912
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058920
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058928
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058936
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521058944
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059224
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059232
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059240
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059248
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059256
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059264
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059344
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059352
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059360
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059368
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059376
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059384
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059392
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059400
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059408
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059416
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059424
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059432
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059440
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059448
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059456
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059464
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059480
Aug 5 01:25:41 PINEWOOD kernel: md: parity incorrect, sector=3521059488
Aug 5 02:08:23 PINEWOOD sSMTP[9957]: Creating SSL connection to host
Aug 5 02:08:23 PINEWOOD sSMTP[9957]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Aug 5 02:08:26 PINEWOOD sSMTP[9957]: Sent mail for *******@gmail.com (221 2.0.0 closing connection u81sm1294126oie.11 - gsmtp) uid=0 username=root outbytes=1010
Aug 5 03:12:07 PINEWOOD kernel: kvm: no hardware support
Aug 5 03:12:07 PINEWOOD kernel: kvm: Nested Virtualization enabled
Aug 5 03:12:07 PINEWOOD kernel: kvm: Nested Paging enabled
Aug 5 03:14:48 PINEWOOD kernel: kvm: already loaded the other module
Aug 5 03:15:00 PINEWOOD emhttp: /usr/bin/tail -n 42 -f /var/log/syslog 2>&1

 

The setup is bog simple, no docker, no kvm, no nada.... only a few plugins

 

plugin: checking unassigned.devices.plg ...
plugin: checking preclear.disk.plg ...
plugin: checking dynamix.system.temp.plg ...
plugin: checking dynamix.system.stats.plg ...
plugin: checking dynamix.system.info.plg ...
plugin: checking NerdPack.plg ...
plugin: checking unRAIDServer.plg ...

 

all up todate

 

Now, the ONLY other incongruity that I just remembered is that on BOTH the correcting and (non) Non-Correcting Check... I had a pre-clear running on a drive connected via USB3 and Unassigned Devices plug in (hey, a data point it a data point).

 

this is the smart report for the Parity drive, as of 2 minutes ago

 

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EZRX-00SPEB0
Serial Number:    WD-WCC4E5RDX2N5
LU WWN Device Id: 5 0014ee 26006bcf0
Firmware Version: 80.00A80
User Capacity:    4,000,753,476,096 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug  5 03:07:48 2015 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(51120) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 512) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   180   021    Pre-fail  Always       -       7883
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       139
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2687
194 Temperature_Celsius     0x0022   122   105   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       103         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Data Point: the Parity and one data drive are on the M'Board controller, the other 8 drives are on a AOC-SASLP-MV8.

 

Now what? I can assume that this should now be (the 1st check was a correcting check, so it should have corrected?)

Replace the 4tb parity with another one?

A data drive is flakey? (mind you it's had a dozen parity checks run before, no errors ever)

Re-seat all the cables? (again)

Wait till the pre-clear is done, remove the drive from the system (Data Point: AFAIK, I've never had it running a preclear with these plugins running before) and try again?

 

This is giving me serious heartburn.....

 

 

 

 

Link to comment

This is a known problem -- can't find the previous thread about it right now (very tired ... heading to bed), but it's been seen before.    The sync "errors" are NOT actual errors (that's the good news) ... but clearly there's a problem that's causing the false sync error count.    I don't recall what was done (if anything) to resolve this in the previous thread -- you may want to search for it.  Otherwise I'll see if I can find it tomorrow afternoon.

 

Link to comment

ok just disabled spindown and restarting the (non-correcting) parity check. It sorta makes sense, why it's 80 (apparently always) errors, all consecutive (drive has to spin down/up), why now (bigger drive takes longer to get where its' going :) )....

 

However leaving the parity drive spinning all the time is a PITA (or having to do it manually before the auto check)...

 

ATTENTION POWERS THAT BE, maybe the parity check should AUTOMAGICALLY disable drive spin down while a parity check is running????

 

will report back in 13-14 hours

Link to comment

Well didn't have to wait long.... parity drive set to not spin down, same thing....

 

Total size: 	4 TB 	
Elapsed time: 	12 hours, 6 minutes 	
Current position: 	2.78 TB (69.5 %) 	
Estimated speed: 	72.2 MB/sec 	
Estimated finish: 	4 hours, 41 minutes 	
Sync errors detected: 	80

 

errors in the same place

 

Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058888
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058896
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058904
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058912
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058920
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058928
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058936
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521058944
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059224
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059232
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059240
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059248
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059256
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059264
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059344
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059352
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059360
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059368
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059376
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059384
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059392
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059400
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059408
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059416
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059424
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059432
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059440
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059448
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059456
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059464
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059480
Aug 5 13:07:22 PINEWOOD kernel: md: parity incorrect, sector=3521059488

 

should I put in a new 4tb and see what happens? this is starting to get on my nerves

 

addendum....

 

just noticed this on the drive page

 

Name:    Parity

Partition size:    3,906,985,768 KB (K=1024)

Partition format:    GPT: 4K-aligned

 

3tb???? it's a 4tb drive...

 

Device Model:	WDC WD40EZRX-00SPEB0
Serial Number:	WD-WCC4E5RDX2N5
LU WWN Device Id:	5 0014ee 26006bcf0
Firmware Version:	80.00A80
User Capacity:	4,000,753,476,096 bytes [4.00 TB]
Sector Sizes:	512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device:	Not in smartctl database [for details use: -P showall]
ATA Version:	ACS-2 (minor revision not indicated)
SATA Version:	SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time:	Wed Aug 5 16:58:24 2015 MDT
SMART support:	Available - device has SMART capability.
SMART support:	Enabled
SMART overall-health:	Passed

 

and it looks like it's getitng the errors just about the time it hits the 3tb mark??????

Link to comment

First you need to confirm it's not real.    The number you're seeing (80) is different than the ones who have reported similar issues in the other threads ... but since you're doing non-correcting checks it's not absolutely clear that this is the same issue.

 

I'd (a) disable spin-down for the parity drive; and (b) run two correcting checks in a row and see if you get the same results each time.    If so, then it's reasonably certain this is the same issue that was reported in the thread I referenced earlier.    If not, you may find that the first check actually resolves the errors and you no longer get them on the second one.

 

Link to comment

Actually the 1st check as a correcting check (the one right after the I did the parity rebuild) but spindown was still active... so ok spindown off, two correcting checks. It DOES APPEAR to be the same spot on each check, (same block of sectors). Will report back in 30+ hours....

 

I will add that this system has been the same hardware for over a year, almost 2. The ONLY thing I did was change out the 3tb parity for that 4tb.. but onward....

Link to comment

Small update, 7 hours in and....

 

Parity-Check in progress.

Cancel will stop the Parity-Check.

Write corrections to parity disk

Total size: 4 TB

Elapsed time: 7 hours

Current position: 1.83 TB (45.8 %)

Estimated speed: 68.0 MB/sec

Estimated finish: 8 hours, 51 minutes

Sync errors corrected: 80

 

it's already found the 'spot', check the log and same batch of sectors, noticed it when it was 1,83tb in (so that blows one of my theories) but no idea where it hit 'it'. Will let it finish and then start pass 2. I was reasonably smart, rebooted the system before starting this so the log file will have NOTHING but this in it.

 

9 hours till pass 2 starts.

Link to comment

I'd say it's virtually certain that these are not actual errors ... but unless you have either MD5's or a complete set of backups you can compare your data against, it's not possible to be absolutely certainty.

 

When you disabled spindown, did you do it for the DRIVE  (the parity drive) ... not just the global setting.  That's what seemed to resolve this in the other thread.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.