Found 2 parity errors twice(in a row)


Recommended Posts

So after my computer shut down randomly, once I booted up again, the parity check found 2 errors.  I decided to do another parity check a few days later, and it still shows 2 errors.  This is my log.

 

Using v4.6 AIO

 

 

Jan 19 08:48:17 Tower emhttp: Spinning up all drives...

Jan 19 08:48:17 Tower kernel: mdcmd (18): spinup 0

Jan 19 08:48:17 Tower kernel: mdcmd (19): spinup 1

Jan 19 08:48:17 Tower kernel: mdcmd (20): spinup 2

Jan 19 08:51:14 Tower kernel: mdcmd (21): check CORRECT

Jan 19 08:51:14 Tower kernel: md: recovery thread woken up ...

Jan 19 08:51:14 Tower kernel: md: recovery thread checking parity...

Jan 19 08:51:14 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jan 19 09:30:41 Tower kernel: md: parity incorrect: 495961960

Jan 19 09:33:28 Tower kernel: md: parity incorrect: 529594768

Jan 19 13:08:44 Tower kernel: mdcmd (22): spindown 2

Jan 19 13:42:28 Tower ntpd[1804]: time reset -0.169712 s

Jan 19 13:42:51 Tower ntpd[1804]: synchronized to 208.83.212.8, stratum 2

Jan 19 15:48:58 Tower kernel: md: sync done. time=25065sec rate=77937K/sec

Jan 19 15:48:58 Tower kernel: md: recovery thread sync completion status: 0

 

Link to comment

smartctl -t short /dev/[sh]d?

 

For instance:

smartctl -t short /dev/sda

smartctl -t short /dev/sdb

smartctl -t short /dev/sdc

smartctl -t short /dev/sdd

 

Then wait 2 to 3 minutes, then retrieve the results using:

 

smartctl -a /dev/sda

smartctl -a /dev/sdb

smartctl -a /dev/sdc

smartctl -a /dev/sdd

 

 

 

Link to comment

smartctl -t short /dev/[sh]d?

 

For instance:

smartctl -t short /dev/sda

smartctl -t short /dev/sdb

smartctl -t short /dev/sdc

smartctl -t short /dev/sdd

 

Then wait 2 to 3 minutes, then retrieve the results using:

 

smartctl -a /dev/sda

smartctl -a /dev/sdb

smartctl -a /dev/sdc

smartctl -a /dev/sdd

 

 

 

 

So it only takes a few minutes to complete?  Do I have to stop the server while doing the tests?

Link to comment

smartctl -t short /dev/[sh]d?

 

For instance:

smartctl -t short /dev/sda

smartctl -t short /dev/sdb

smartctl -t short /dev/sdc

smartctl -t short /dev/sdd

 

Then wait 2 to 3 minutes, then retrieve the results using:

 

smartctl -a /dev/sda

smartctl -a /dev/sdb

smartctl -a /dev/sdc

smartctl -a /dev/sdd

 

 

 

 

So it only takes a few minutes to complete?  Do I have to stop the server while doing the tests?

No.
Link to comment

So after my computer shut down randomly, once I booted up again, the parity check found 2 errors.  I decided to do another parity check a few days later, and it still shows 2 errors.  This is my log.

 

Using v4.6 AIO

 

 

Jan 19 08:48:17 Tower emhttp: Spinning up all drives...

Jan 19 08:48:17 Tower kernel: mdcmd (18): spinup 0

Jan 19 08:48:17 Tower kernel: mdcmd (19): spinup 1

Jan 19 08:48:17 Tower kernel: mdcmd (20): spinup 2

Jan 19 08:51:14 Tower kernel: mdcmd (21): check CORRECT

Jan 19 08:51:14 Tower kernel: md: recovery thread woken up ...

Jan 19 08:51:14 Tower kernel: md: recovery thread checking parity...

Jan 19 08:51:14 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jan 19 09:30:41 Tower kernel: md: parity incorrect: 495961960

Jan 19 09:33:28 Tower kernel: md: parity incorrect: 529594768

Jan 19 13:08:44 Tower kernel: mdcmd (22): spindown 2

Jan 19 13:42:28 Tower ntpd[1804]: time reset -0.169712 s

Jan 19 13:42:51 Tower ntpd[1804]: synchronized to 208.83.212.8, stratum 2

Jan 19 15:48:58 Tower kernel: md: sync done. time=25065sec rate=77937K/sec

Jan 19 15:48:58 Tower kernel: md: recovery thread sync completion status: 0

 

 

I remember a long time ago, a person rean a parity check and got a bunch of parity errors, and then reran it and got EXACTLY the same number of parity errors.  Turns out one of the drives in the first parity check returned bad data, and then the second parity check basically corrected the parity data.

Link to comment

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:     WDC WD20EARS-00MVWB0

Serial Number:    WD-WMAZA1594200

Firmware Version: 51.0AB51

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Wed Jan 19 22:11:36 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

                                       was suspended by an interrupting command

from host.

                                       Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                       without error or no self-test has ever

                                       been run.

Total time to complete Offline

data collection:                 (37980) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                       Auto Offline data collection on/off supp

ort.

                                       Suspend Offline collection upon new

                                       command.

                                       Offline surface scan supported.

                                       Self-test supported.

                                       Conveyance Self-test supported.

                                       Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                       power-saving mode.

                                       Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                       General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x3035) SCT Status supported.

                                       SCT Feature Control supported.

                                       SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_

FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -

      0

 3 Spin_Up_Time            0x0027   209   190   021    Pre-fail  Always       -

      4525

 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -

      46

 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -

      0

 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -

      0

 9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -

      164

10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -

      0

11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -

      0

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -

      32

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -

      25

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -

      145

194 Temperature_Celsius     0x0022   113   109   000    Old_age   Always       -

      37

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -

      0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -

      0

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -

      0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -

      0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -

      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA

_of_first_error

# 1  Short offline       Completed without error       00%       164         -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

   1        0        0  Not_testing

   2        0        0  Not_testing

   3        0        0  Not_testing

   4        0        0  Not_testing

   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

That was the parity drive, now this is the 2nd main drive

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

                                       was suspended by an interrupting command

from host.

                                       Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                       without error or no self-test has ever

                                       been run.

Total time to complete Offline

data collection:                 (37260) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                       Auto Offline data collection on/off supp

ort.

                                       Suspend Offline collection upon new

                                       command.

                                       Offline surface scan supported.

                                       Self-test supported.

                                       Conveyance Self-test supported.

                                       Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                       power-saving mode.

                                       Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                       General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x3035) SCT Status supported.

                                       SCT Feature Control supported.

                                       SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_

FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -

      0

 3 Spin_Up_Time            0x0027   182   167   021    Pre-fail  Always       -

      5858

 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -

      38

 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -

      0

 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -

      0

 9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -

      164

10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -

      0

11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -

      0

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -

      25

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -

      17

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -

      182

194 Temperature_Celsius     0x0022   113   110   000    Old_age   Always       -

      37

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -

      0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -

      0

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -

      0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -

      0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -

      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA

_of_first_error

# 1  Short offline       Completed without error       00%       164         -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

   1        0        0  Not_testing

   2        0        0  Not_testing

   3        0        0  Not_testing

   4        0        0  Not_testing

   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

So what is the problem?

 

 

Link to comment

Do all the drives.

 

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

 

Peter

 

This is the drive I added in *AFTER* the parity errors.  I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

 

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.11 family

Device Model:     ST31000340AS

Serial Number:    5QJ0X94K

Firmware Version: SD15

User Capacity:    1,000,204,886,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Wed Jan 19 22:15:57 2011 EST

 

==> WARNING: There are known problems with these drives,

AND THIS FIRMWARE VERSION IS AFFECTED,

see the following Seagate web pages:

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951

 

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                       was completed without error.

                                       Auto Offline Data Collection: Enabled.

Self-test execution status:      (  25) The self-test routine was aborted by

                                       the host.

Total time to complete Offline

data collection:                 ( 642) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                       Auto Offline data collection on/off su

ort.

                                       Suspend Offline collection upon new

                                       command.

                                       Offline surface scan supported.

                                       Self-test supported.

                                       Conveyance Self-test supported.

                                       Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                       power-saving mode.

                                       Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                       General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 237) minutes.

Conveyance self-test routine

recommended polling time:        (   2) minutes.

SCT capabilities:              (0x103b) SCT Status supported.

                                       SCT Feature Control supported.

                                       SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHE

FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always

      123578872

 3 Spin_Up_Time            0x0003   094   091   000    Pre-fail  Always

      0

 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always

      947

 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always

      0

 7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always

      4326860246

 9 Power_On_Hours          0x0032   092   092   000    Old_age   Always

      7437

10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always

      2

12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always

      966

184 End-to-End_Error        0x0032   100   100   099    Old_age   Always

      0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always

      0

188 Command_Timeout         0x0032   100   099   000    Old_age   Always

      8590065666

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always

      0

190 Airflow_Temperature_Cel 0x0022   074   050   045    Old_age   Always

      26 (Lifetime Min/Max 24/39)

194 Temperature_Celsius     0x0022   026   050   000    Old_age   Always

      26 (0 17 0 0)

195 Hardware_ECC_Recovered  0x001a   049   013   000    Old_age   Always

      123578872

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always

      0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline

      0

199 UDMA_CRC_Error_Count    0x003e   200   193   000    Old_age   Always

      19

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  L

_of_first_error

# 1  Short offline       Aborted by host               90%      7437         -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

   1        0        0  Not_testing

   2        0        0  Not_testing

   3        0        0  Not_testing

   4        0        0  Not_testing

   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Should I install the new firmware or just not even use it?

Link to comment

Do all the drives.

 

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

 

Peter

 

This is the drive I added in *AFTER* the parity errors.  I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

I see nothing wrong, what do you see?  (other than the warning you need to update the firmware)

 

Joe L.

Link to comment

Do all the drives.

 

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

 

Peter

 

This is the drive I added in *AFTER* the parity errors.  I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

I see nothing wrong, what do you see?  (other than the warning you need to update the firmware)

 

Joe L.

 

1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always

      123578872

 

The others had 0 as the value.  Regardless, why am i getting parity errors?

Link to comment

Do all the drives.

 

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

 

Peter

 

This is the drive I added in *AFTER* the parity errors.  I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

I see nothing wrong, what do you see?  (other than the warning you need to update the firmware)

 

Joe L.

 

1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always

       123578872

 

The others had 0 as the value.  Regardless, why am i getting parity errors?

Absolutely nothing wrong.  The VALUE of 117 is well above the failure THRESHOLD of 6. The worst it has ever been is 99, still well above the failure threshold.

 

The raw value has meaning ONLY to the manufacturer.  Some model drives show a raw value, some do not and show 0.

Link to comment

So should I do another parity check?

 

I would say yes.  And I would expect that this time you'll get no errors. 

 

That said, you might be seeing the same sort of issue that plagued me for at least 6 months last year.  I was getting an occasional parity check error and then the next run of parity checking would show the same number of errors and then for a few more parity checks all would be fine until eventually I would get another error or two.

 

This went on for some time until unRAID 4.5.5 came out which added logging of the block numbers of the first 20 parity errors that were detected.  Once I saw this I started recording the block numbers and found that the errors in the second parity pass were being reported on the same blocks as the first (well usually, sometimes an extra one would show up...).  So I could now see that running the parity check and correct was causing the parity blocks to get flipped between good and bad.  At this point I switched to using unMENU's "parity check but do not correct" function for all my parity checks.  With this I would see an error or two appear on one pass and then not appear again on subsequent passes.

 

Over time (and many parity checks) I started to notice that all the block numbers were within limited band of about 200,000,000 blocks, or about 100GB of disk surface.  I also recorded a number of blocks reappearing several times, out of 45 blocks that reported problems 13 were bad twice, 4 were bad 3 times and one was bad 5 times.

 

I replaced cables, controller card, motherboard, ram, cpu but not the case or PSU.  All without and change to the problem.

 

During this time there was very little indication of anything bad in the SMART reports, and I logged smart reports for all my drives after each parity test and then used kdiff to compare them for differences. 

 

I replaced two drives which had shown a few SMART errors without affecting this issue.

 

Eventually I decided that the problem had to be a bad drive, but it must be manifesting in a way that was eluding the SMART system.  So given the range of blocks that were involved I figured that I could reasonably do an MD5SUM of that block range on each disk, record the first value and then repeat the test a number of times until I found a drive which returned a different value.

 

This approach worked quite well.  Initially I ran this script (it exercised the flaky range on all my drives including the parity):

dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b > sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b > sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b > sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b > sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b > sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b > sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log

 

Just be very careful about using "dd" as it can be used to wipe out a drive...

 

The skip tells dd to skip over the first 141919504 blocks (as I only had a couple of errors with block number less than this) and then the count tells dd to copy the 204597073 (about 100GB) of blocks that follow. 

 

In the first pass I repeated the dd|md5sum test three times on each drive and then I looked at the results.  On one drive the results looked like:

 

fe1804307062eeb93261b18bb63036bf *-
740bc5c5ef4eab169f627d5c2ae45dfa *-
73c0b467088be422c9999f415e81adab *-

 

BINGO! a different md5 hash each time the drive was tested (and I was pretty certain nothing was writing to it during the test).

 

Since the parity drive's md5 sums were constant through the test I realized that nothing on the array was being written to during the test so these changing values must indicate a drive problem.

 

I did some more tests just limiting my runs to the bad drive and the parity drive and got this:

 

and got this sdc log file (for the parity drive):

e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-

so the parity did not change, and yet sdb2.log looked like:

73c0b467088be422c9999f415e81adab *-
73c0b467088be422c9999f415e81adab *-
600cb1e4adc284ca913c5ef4d83b9d20 *-
8978f191ba809134207e5f385997aac1 *-
c7a4ed6c9c4dd1b71794ce5d94644e72 *-
3cb91ce1e320d7de8e0e302a9d70d2f9 *-
93c82b4ff2b98172ffe9a48266a43361 *-

 

Now note that the magic number 73c0b467088be422c9999f415e81adab has shown up a number of times in the result from the bad drive, so I suspected this was returned when the drive decided to work correctly. 

 

Next I replaced the suspect drive with a fresh (precleared!) drive and rebuilt based on parity, and then ran the test script and again got the "good" md5 value of 73c0b467088be422c9999f415e81adab.

 

Since this point in time I have done 18 parity checks without a single error, when before I was typically only getting 2 or 3 (and at most 7) parity checks between errors.

 

 

I also tried doing a long smart test on the bad drive (several in fact) and these did not change anything.

 

Once I did a preclear of the bad drive (after having removed it from the array) I then found the drive was behaving correctly through dd|md5 tests, so it looks like the preclearing forced the drive to correct the issues, but I'm not putting it back into the array!

 

Here's the last (before preclearing) smart log from the bad drive:

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   181   021    Pre-fail  Always       -       5891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       438
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8566
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       34
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       43744
194 Temperature_Celsius     0x0022   122   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

 

It all looks just fine.  The only errors I ever noticed (and these were reported on various drive in the array - not just the bad drive) were the occasional Multi_Zone_Error_Rate, Raw_Read_Error_Rate and a couple of Current_Pending_Sector errors, all of which went away after a few parity check/smart report cycles.  Sometimes one of these errors would appear during a parity check with errors, other times they would appear during a parity check without any errors - they were not well correlated.

 

Regards,

 

Stephen

 

 

 

 

 

Link to comment

So should I do another parity check?

 

I would say yes.  And I would expect that this time you'll get no errors. 

 

That said, you might be seeing the same sort of issue that plagued me for at least 6 months last year.  I was getting an occasional parity check error and then the next run of parity checking would show the same number of errors and then for a few more parity checks all would be fine until eventually I would get another error or two.

 

This went on for some time until unRAID 4.5.5 came out which added logging of the block numbers of the first 20 parity errors that were detected.  Once I saw this I started recording the block numbers and found that the errors in the second parity pass were being reported on the same blocks as the first (well usually, sometimes an extra one would show up...).  So I could now see that running the parity check and correct was causing the parity blocks to get flipped between good and bad.  At this point I switched to using unMENU's "parity check but do not correct" function for all my parity checks.  With this I would see an error or two appear on one pass and then not appear again on subsequent passes.

 

Over time (and many parity checks) I started to notice that all the block numbers were within limited band of about 200,000,000 blocks, or about 100GB of disk surface.  I also recorded a number of blocks reappearing several times, out of 45 blocks that reported problems 13 were bad twice, 4 were bad 3 times and one was bad 5 times.

 

I replaced cables, controller card, motherboard, ram, cpu but not the case or PSU.  All without and change to the problem.

 

During this time there was very little indication of anything bad in the SMART reports, and I logged smart reports for all my drives after each parity test and then used kdiff to compare them for differences. 

 

I replaced two drives which had shown a few SMART errors without affecting this issue.

 

Eventually I decided that the problem had to be a bad drive, but it must be manifesting in a way that was eluding the SMART system.  So given the range of blocks that were involved I figured that I could reasonably do an MD5SUM of that block range on each disk, record the first value and then repeat the test a number of times until I found a drive which returned a different value.

 

This approach worked quite well.  Initially I ran this script (it exercised the flaky range on all my drives including the parity):

dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b > sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b > sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b > sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b > sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b > sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b > sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log

 

Just be very careful about using "dd" as it can be used to wipe out a drive...

 

The skip tells dd to skip over the first 141919504 blocks (as I only had a couple of errors with block number less than this) and then the count tells dd to copy the 204597073 (about 100GB) of blocks that follow. 

 

In the first pass I repeated the dd|md5sum test three times on each drive and then I looked at the results.  On one drive the results looked like:

 

fe1804307062eeb93261b18bb63036bf *-
740bc5c5ef4eab169f627d5c2ae45dfa *-
73c0b467088be422c9999f415e81adab *-

 

BINGO! a different md5 hash each time the drive was tested (and I was pretty certain nothing was writing to it during the test).

 

Since the parity drive's md5 sums were constant through the test I realized that nothing on the array was being written to during the test so these changing values must indicate a drive problem.

 

I did some more tests just limiting my runs to the bad drive and the parity drive and got this:

 

and got this sdc log file (for the parity drive):

e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-

so the parity did not change, and yet sdb2.log looked like:

73c0b467088be422c9999f415e81adab *-
73c0b467088be422c9999f415e81adab *-
600cb1e4adc284ca913c5ef4d83b9d20 *-
8978f191ba809134207e5f385997aac1 *-
c7a4ed6c9c4dd1b71794ce5d94644e72 *-
3cb91ce1e320d7de8e0e302a9d70d2f9 *-
93c82b4ff2b98172ffe9a48266a43361 *-

 

Now note that the magic number 73c0b467088be422c9999f415e81adab has shown up a number of times in the result from the bad drive, so I suspected this was returned when the drive decided to work correctly. 

 

Next I replaced the suspect drive with a fresh (precleared!) drive and rebuilt based on parity, and then ran the test script and again got the "good" md5 value of 73c0b467088be422c9999f415e81adab.

 

Since this point in time I have done 18 parity checks without a single error, when before I was typically only getting 2 or 3 (and at most 7) parity checks between errors.

 

 

I also tried doing a long smart test on the bad drive (several in fact) and these did not change anything.

 

Once I did a preclear of the bad drive (after having removed it from the array) I then found the drive was behaving correctly through dd|md5 tests, so it looks like the preclearing forced the drive to correct the issues, but I'm not putting it back into the array!

 

Here's the last (before preclearing) smart log from the bad drive:

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   181   021    Pre-fail  Always       -       5891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       438
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8566
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       34
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       43744
194 Temperature_Celsius     0x0022   122   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

 

It all looks just fine.  The only errors I ever noticed (and these were reported on various drive in the array - not just the bad drive) were the occasional Multi_Zone_Error_Rate, Raw_Read_Error_Rate and a couple of Current_Pending_Sector errors, all of which went away after a few parity check/smart report cycles.  Sometimes one of these errors would appear during a parity check with errors, other times they would appear during a parity check without any errors - they were not well correlated.

 

Regards,

 

Stephen

 

 

Thanks foe the info.  I think I will probably just wait until the new version of unraid become the stable release, and then advanced format my drives.  I only have aroudn 1tb of data so far, so I can just copy it to my other drives, format all of it and then put it back on.  Seems like a lot of work, but I guess it has to be done!  Where can I buy jumpers for these hds?

Link to comment

At this point, run a parity check again using the nocorrect option.

 

mdcmd check NOCORRECT

 

Check where any errors occur and see if it's a common spot.

 

Your drives are already advanced format. An advanced format drive refers to how the data is stored internally on the drive. No amount of formatting or a jumper will change how the drive works internally. If you have formatted EARS drives without a jumper then you can delete the partition and let unRAID rebuild the drive with the correct partition.

 

Peter

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.