Found 2 parity errors twice(in a row)

Tijuana · January 19, 2011

So after my computer shut down randomly, once I booted up again, the parity check found 2 errors. I decided to do another parity check a few days later, and it still shows 2 errors. This is my log.

Using v4.6 AIO

Jan 19 08:48:17 Tower emhttp: Spinning up all drives...

Jan 19 08:48:17 Tower kernel: mdcmd (18): spinup 0

Jan 19 08:48:17 Tower kernel: mdcmd (19): spinup 1

Jan 19 08:48:17 Tower kernel: mdcmd (20): spinup 2

Jan 19 08:51:14 Tower kernel: mdcmd (21): check CORRECT

Jan 19 08:51:14 Tower kernel: md: recovery thread woken up ...

Jan 19 08:51:14 Tower kernel: md: recovery thread checking parity...

Jan 19 08:51:14 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jan 19 09:30:41 Tower kernel: md: parity incorrect: 495961960

Jan 19 09:33:28 Tower kernel: md: parity incorrect: 529594768

Jan 19 13:08:44 Tower kernel: mdcmd (22): spindown 2

Jan 19 13:42:28 Tower ntpd[1804]: time reset -0.169712 s

Jan 19 13:42:51 Tower ntpd[1804]: synchronized to 208.83.212.8, stratum 2

Jan 19 15:48:58 Tower kernel: md: sync done. time=25065sec rate=77937K/sec

Jan 19 15:48:58 Tower kernel: md: recovery thread sync completion status: 0

lionelhutz · January 19, 2011

Try a SMART test on each drive and see if one is showing bad sectors or sectors pending reallocation.

Peter

Tijuana · January 19, 2011

Try a SMART test on each drive and see if one is showing bad sectors or sectors pending reallocation.

Peter

And I do this how...

kizer · January 20, 2011

Found this in the Troubleshooting URL in the Read here first stickied at the top.

http://lime-technology.com/wiki/index.php?title=Troubleshooting#Running_a_SMART_test

No biggie and I hope it helps you.

BRiT · January 20, 2011

smartctl -t short /dev/[sh]d?

For instance:

smartctl -t short /dev/sda

smartctl -t short /dev/sdb

smartctl -t short /dev/sdc

smartctl -t short /dev/sdd

Then wait 2 to 3 minutes, then retrieve the results using:

smartctl -a /dev/sda

smartctl -a /dev/sdb

smartctl -a /dev/sdc

smartctl -a /dev/sdd

Tijuana · January 20, 2011

smartctl -t short /dev/[sh]d?

For instance:

smartctl -t short /dev/sda

smartctl -t short /dev/sdb

smartctl -t short /dev/sdc

smartctl -t short /dev/sdd

Then wait 2 to 3 minutes, then retrieve the results using:

smartctl -a /dev/sda

smartctl -a /dev/sdb

smartctl -a /dev/sdc

smartctl -a /dev/sdd

So it only takes a few minutes to complete? Do I have to stop the server while doing the tests?

Joe L. · January 20, 2011

smartctl -t short /dev/[sh]d?

For instance:

smartctl -t short /dev/sda

smartctl -t short /dev/sdb

smartctl -t short /dev/sdc

smartctl -t short /dev/sdd

Then wait 2 to 3 minutes, then retrieve the results using:

smartctl -a /dev/sda

smartctl -a /dev/sdb

smartctl -a /dev/sdc

smartctl -a /dev/sdd

So it only takes a few minutes to complete? Do I have to stop the server while doing the tests?

No.

SSD · January 20, 2011

So after my computer shut down randomly, once I booted up again, the parity check found 2 errors. I decided to do another parity check a few days later, and it still shows 2 errors. This is my log.

Using v4.6 AIO

Jan 19 08:48:17 Tower emhttp: Spinning up all drives...

Jan 19 08:48:17 Tower kernel: mdcmd (18): spinup 0

Jan 19 08:48:17 Tower kernel: mdcmd (19): spinup 1

Jan 19 08:48:17 Tower kernel: mdcmd (20): spinup 2

Jan 19 08:51:14 Tower kernel: mdcmd (21): check CORRECT

Jan 19 08:51:14 Tower kernel: md: recovery thread woken up ...

Jan 19 08:51:14 Tower kernel: md: recovery thread checking parity...

Jan 19 08:51:14 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.

Jan 19 09:30:41 Tower kernel: md: parity incorrect: 495961960

Jan 19 09:33:28 Tower kernel: md: parity incorrect: 529594768

Jan 19 13:08:44 Tower kernel: mdcmd (22): spindown 2

Jan 19 13:42:28 Tower ntpd[1804]: time reset -0.169712 s

Jan 19 13:42:51 Tower ntpd[1804]: synchronized to 208.83.212.8, stratum 2

Jan 19 15:48:58 Tower kernel: md: sync done. time=25065sec rate=77937K/sec

Jan 19 15:48:58 Tower kernel: md: recovery thread sync completion status: 0

I remember a long time ago, a person rean a parity check and got a bunch of parity errors, and then reran it and got EXACTLY the same number of parity errors. Turns out one of the drives in the first parity check returned bad data, and then the second parity check basically corrected the parity data.

Tijuana · January 20, 2011

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EARS-00MVWB0

Serial Number: WD-WMAZA1594200

Firmware Version: 51.0AB51

User Capacity: 2,000,398,934,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: Exact ATA specification draft version not indicated

Local Time is: Wed Jan 19 22:11:36 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x84) Offline data collection activity

was suspended by an interrupting command

from host.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (37980) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off supp

ort.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_

FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always -

0

3 Spin_Up_Time 0x0027 209 190 021 Pre-fail Always -

4525

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always -

46

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always -

0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always -

0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always -

164

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always -

0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always -

0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always -

32

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always -

25

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always -

145

194 Temperature_Celsius 0x0022 113 109 000 Old_age Always -

37

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always -

0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always -

0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline -

0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always -

0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline -

0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA

_of_first_error

# 1 Short offline Completed without error 00% 164 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

That was the parity drive, now this is the 2nd main drive

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x84) Offline data collection activity

was suspended by an interrupting command

from host.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (37260) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off supp

ort.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_

FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always -

0

3 Spin_Up_Time 0x0027 182 167 021 Pre-fail Always -

5858

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always -

38

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always -

0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always -

0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always -

164

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always -

0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always -

0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always -

25

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always -

17

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always -

182

194 Temperature_Celsius 0x0022 113 110 000 Old_age Always -

37

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always -

0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always -

0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline -

0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always -

0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline -

0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA

_of_first_error

# 1 Short offline Completed without error 00% 164 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

So what is the problem?

lionelhutz · January 20, 2011

Do all the drives.

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

Peter

Tijuana · January 20, 2011

Do all the drives.

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

Peter

This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family

Device Model: ST31000340AS

Serial Number: 5QJ0X94K

Firmware Version: SD15

User Capacity: 1,000,204,886,016 bytes

Device is: In smartctl database [for details use: -P show]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Wed Jan 19 22:15:57 2011 EST

==> WARNING: There are known problems with these drives,

AND THIS FIRMWARE VERSION IS AFFECTED,

see the following Seagate web pages:

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 25) The self-test routine was aborted by

the host.

Total time to complete Offline

data collection: ( 642) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off su

ort.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 237) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x103b) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHE

FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always

123578872

3 Spin_Up_Time 0x0003 094 091 000 Pre-fail Always

0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always

947

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always

0

7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always

4326860246

9 Power_On_Hours 0x0032 092 092 000 Old_age Always

7437

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always

2

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always

966

184 End-to-End_Error 0x0032 100 100 099 Old_age Always

0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always

0

188 Command_Timeout 0x0032 100 099 000 Old_age Always

8590065666

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always

0

190 Airflow_Temperature_Cel 0x0022 074 050 045 Old_age Always

26 (Lifetime Min/Max 24/39)

194 Temperature_Celsius 0x0022 026 050 000 Old_age Always

26 (0 17 0 0)

195 Hardware_ECC_Recovered 0x001a 049 013 000 Old_age Always

123578872

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always

0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline

0

199 UDMA_CRC_Error_Count 0x003e 200 193 000 Old_age Always

19

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) L

_of_first_error

# 1 Short offline Aborted by host 90% 7437 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Should I install the new firmware or just not even use it?

Joe L. · January 20, 2011

Do all the drives.

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

Peter

This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

I see nothing wrong, what do you see? (other than the warning you need to update the firmware)

Joe L.

Tijuana · January 20, 2011

Do all the drives.

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

Peter

This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

I see nothing wrong, what do you see? (other than the warning you need to update the firmware)

Joe L.

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always

123578872

The others had 0 as the value. Regardless, why am i getting parity errors?

Joe L. · January 20, 2011

Do all the drives.

Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0.

Peter

This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using.

I see nothing wrong, what do you see? (other than the warning you need to update the firmware)

Joe L.

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always

123578872

The others had 0 as the value. Regardless, why am i getting parity errors?

Absolutely nothing wrong. The VALUE of 117 is well above the failure THRESHOLD of 6. The worst it has ever been is 99, still well above the failure threshold.

The raw value has meaning ONLY to the manufacturer. Some model drives show a raw value, some do not and show 0.

Tijuana · January 20, 2011

So should I do another parity check?

vca · January 20, 2011

So should I do another parity check?

I would say yes. And I would expect that this time you'll get no errors.

That said, you might be seeing the same sort of issue that plagued me for at least 6 months last year. I was getting an occasional parity check error and then the next run of parity checking would show the same number of errors and then for a few more parity checks all would be fine until eventually I would get another error or two.

This went on for some time until unRAID 4.5.5 came out which added logging of the block numbers of the first 20 parity errors that were detected. Once I saw this I started recording the block numbers and found that the errors in the second parity pass were being reported on the same blocks as the first (well usually, sometimes an extra one would show up...). So I could now see that running the parity check and correct was causing the parity blocks to get flipped between good and bad. At this point I switched to using unMENU's "parity check but do not correct" function for all my parity checks. With this I would see an error or two appear on one pass and then not appear again on subsequent passes.

Over time (and many parity checks) I started to notice that all the block numbers were within limited band of about 200,000,000 blocks, or about 100GB of disk surface. I also recorded a number of blocks reappearing several times, out of 45 blocks that reported problems 13 were bad twice, 4 were bad 3 times and one was bad 5 times.

I replaced cables, controller card, motherboard, ram, cpu but not the case or PSU. All without and change to the problem.

During this time there was very little indication of anything bad in the SMART reports, and I logged smart reports for all my drives after each parity test and then used kdiff to compare them for differences.

I replaced two drives which had shown a few SMART errors without affecting this issue.

Eventually I decided that the problem had to be a bad drive, but it must be manifesting in a way that was eluding the SMART system. So given the range of blocks that were involved I figured that I could reasonably do an MD5SUM of that block range on each disk, record the first value and then repeat the test a number of times until I found a drive which returned a different value.

This approach worked quite well. Initially I ran this script (it exercised the flaky range on all my drives including the parity):

dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b > sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b > sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b > sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b > sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b > sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b > sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log

Just be very careful about using "dd" as it can be used to wipe out a drive...

The skip tells dd to skip over the first 141919504 blocks (as I only had a couple of errors with block number less than this) and then the count tells dd to copy the 204597073 (about 100GB) of blocks that follow.

In the first pass I repeated the dd|md5sum test three times on each drive and then I looked at the results. On one drive the results looked like:

fe1804307062eeb93261b18bb63036bf *-
740bc5c5ef4eab169f627d5c2ae45dfa *-
73c0b467088be422c9999f415e81adab *-

BINGO! a different md5 hash each time the drive was tested (and I was pretty certain nothing was writing to it during the test).

Since the parity drive's md5 sums were constant through the test I realized that nothing on the array was being written to during the test so these changing values must indicate a drive problem.

I did some more tests just limiting my runs to the bad drive and the parity drive and got this:

and got this sdc log file (for the parity drive):

e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-

so the parity did not change, and yet sdb2.log looked like:

73c0b467088be422c9999f415e81adab *-
73c0b467088be422c9999f415e81adab *-
600cb1e4adc284ca913c5ef4d83b9d20 *-
8978f191ba809134207e5f385997aac1 *-
c7a4ed6c9c4dd1b71794ce5d94644e72 *-
3cb91ce1e320d7de8e0e302a9d70d2f9 *-
93c82b4ff2b98172ffe9a48266a43361 *-

Now note that the magic number 73c0b467088be422c9999f415e81adab has shown up a number of times in the result from the bad drive, so I suspected this was returned when the drive decided to work correctly.

Next I replaced the suspect drive with a fresh (precleared!) drive and rebuilt based on parity, and then ran the test script and again got the "good" md5 value of 73c0b467088be422c9999f415e81adab.

Since this point in time I have done 18 parity checks without a single error, when before I was typically only getting 2 or 3 (and at most 7) parity checks between errors.

I also tried doing a long smart test on the bad drive (several in fact) and these did not change anything.

Once I did a preclear of the bad drive (after having removed it from the array) I then found the drive was behaving correctly through dd|md5 tests, so it looks like the preclearing forced the drive to correct the issues, but I'm not putting it back into the array!

Here's the last (before preclearing) smart log from the bad drive:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   181   021    Pre-fail  Always       -       5891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       438
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8566
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       34
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       43744
194 Temperature_Celsius     0x0022   122   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

It all looks just fine. The only errors I ever noticed (and these were reported on various drive in the array - not just the bad drive) were the occasional Multi_Zone_Error_Rate, Raw_Read_Error_Rate and a couple of Current_Pending_Sector errors, all of which went away after a few parity check/smart report cycles. Sometimes one of these errors would appear during a parity check with errors, other times they would appear during a parity check without any errors - they were not well correlated.

Regards,

Stephen

Tijuana · January 20, 2011

So should I do another parity check?

I would say yes. And I would expect that this time you'll get no errors.

That said, you might be seeing the same sort of issue that plagued me for at least 6 months last year. I was getting an occasional parity check error and then the next run of parity checking would show the same number of errors and then for a few more parity checks all would be fine until eventually I would get another error or two.

This went on for some time until unRAID 4.5.5 came out which added logging of the block numbers of the first 20 parity errors that were detected. Once I saw this I started recording the block numbers and found that the errors in the second parity pass were being reported on the same blocks as the first (well usually, sometimes an extra one would show up...). So I could now see that running the parity check and correct was causing the parity blocks to get flipped between good and bad. At this point I switched to using unMENU's "parity check but do not correct" function for all my parity checks. With this I would see an error or two appear on one pass and then not appear again on subsequent passes.

Over time (and many parity checks) I started to notice that all the block numbers were within limited band of about 200,000,000 blocks, or about 100GB of disk surface. I also recorded a number of blocks reappearing several times, out of 45 blocks that reported problems 13 were bad twice, 4 were bad 3 times and one was bad 5 times.

I replaced cables, controller card, motherboard, ram, cpu but not the case or PSU. All without and change to the problem.

During this time there was very little indication of anything bad in the SMART reports, and I logged smart reports for all my drives after each parity test and then used kdiff to compare them for differences.

I replaced two drives which had shown a few SMART errors without affecting this issue.

Eventually I decided that the problem had to be a bad drive, but it must be manifesting in a way that was eluding the SMART system. So given the range of blocks that were involved I figured that I could reasonably do an MD5SUM of that block range on each disk, record the first value and then repeat the test a number of times until I found a drive which returned a different value.

This approach worked quite well. Initially I ran this script (it exercised the flaky range on all my drives including the parity):
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b > sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b > sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b > sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b > sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b > sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b > sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log
dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log
dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log
dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log
dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log
dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log
dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log
dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log
Just be very careful about using "dd" as it can be used to wipe out a drive...

The skip tells dd to skip over the first 141919504 blocks (as I only had a couple of errors with block number less than this) and then the count tells dd to copy the 204597073 (about 100GB) of blocks that follow.

In the first pass I repeated the dd|md5sum test three times on each drive and then I looked at the results. On one drive the results looked like:
fe1804307062eeb93261b18bb63036bf *-
740bc5c5ef4eab169f627d5c2ae45dfa *-
73c0b467088be422c9999f415e81adab *-
BINGO! a different md5 hash each time the drive was tested (and I was pretty certain nothing was writing to it during the test).

Since the parity drive's md5 sums were constant through the test I realized that nothing on the array was being written to during the test so these changing values must indicate a drive problem.

I did some more tests just limiting my runs to the bad drive and the parity drive and got this:
and got this sdc log file (for the parity drive):

e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-
e45e3454fd371b8e907f515f085e95e2 *-

so the parity did not change, and yet sdb2.log looked like:

73c0b467088be422c9999f415e81adab *-
73c0b467088be422c9999f415e81adab *-
600cb1e4adc284ca913c5ef4d83b9d20 *-
8978f191ba809134207e5f385997aac1 *-
c7a4ed6c9c4dd1b71794ce5d94644e72 *-
3cb91ce1e320d7de8e0e302a9d70d2f9 *-
93c82b4ff2b98172ffe9a48266a43361 *-
Now note that the magic number 73c0b467088be422c9999f415e81adab has shown up a number of times in the result from the bad drive, so I suspected this was returned when the drive decided to work correctly.

Next I replaced the suspect drive with a fresh (precleared!) drive and rebuilt based on parity, and then ran the test script and again got the "good" md5 value of 73c0b467088be422c9999f415e81adab.

Since this point in time I have done 18 parity checks without a single error, when before I was typically only getting 2 or 3 (and at most 7) parity checks between errors.

I also tried doing a long smart test on the bad drive (several in fact) and these did not change anything.

Once I did a preclear of the bad drive (after having removed it from the array) I then found the drive was behaving correctly through dd|md5 tests, so it looks like the preclearing forced the drive to correct the issues, but I'm not putting it back into the array!

Here's the last (before preclearing) smart log from the bad drive:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   181   021    Pre-fail  Always       -       5891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       438
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8566
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       34
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       43744
194 Temperature_Celsius     0x0022   122   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
It all looks just fine. The only errors I ever noticed (and these were reported on various drive in the array - not just the bad drive) were the occasional Multi_Zone_Error_Rate, Raw_Read_Error_Rate and a couple of Current_Pending_Sector errors, all of which went away after a few parity check/smart report cycles. Sometimes one of these errors would appear during a parity check with errors, other times they would appear during a parity check without any errors - they were not well correlated.

Regards,

Stephen

Thanks foe the info. I think I will probably just wait until the new version of unraid become the stable release, and then advanced format my drives. I only have aroudn 1tb of data so far, so I can just copy it to my other drives, format all of it and then put it back on. Seems like a lot of work, but I guess it has to be done! Where can I buy jumpers for these hds?

lionelhutz · January 20, 2011

At this point, run a parity check again using the nocorrect option.

mdcmd check NOCORRECT

Check where any errors occur and see if it's a common spot.

Your drives are already advanced format. An advanced format drive refers to how the data is stored internally on the drive. No amount of formatting or a jumper will change how the drive works internally. If you have formatted EARS drives without a jumper then you can delete the partition and let unRAID rebuild the drive with the correct partition.

Peter

Found 2 parity errors twice(in a row)

Recommended Posts

Tijuana

Link to comment

lionelhutz

Link to comment

Tijuana

Link to comment

kizer

Link to comment

BRiT

Link to comment

Tijuana

Link to comment

Joe L.

Link to comment

SSD

Link to comment

Tijuana

Link to comment

lionelhutz

Link to comment

Tijuana

Link to comment

Joe L.

Link to comment

Tijuana

Link to comment

Joe L.

Link to comment

Tijuana

Link to comment

vca

Link to comment

Tijuana

Link to comment

lionelhutz

Link to comment

Join the conversation