Tijuana Posted January 19, 2011 Share Posted January 19, 2011 So after my computer shut down randomly, once I booted up again, the parity check found 2 errors. I decided to do another parity check a few days later, and it still shows 2 errors. This is my log. Using v4.6 AIO Jan 19 08:48:17 Tower emhttp: Spinning up all drives... Jan 19 08:48:17 Tower kernel: mdcmd (18): spinup 0 Jan 19 08:48:17 Tower kernel: mdcmd (19): spinup 1 Jan 19 08:48:17 Tower kernel: mdcmd (20): spinup 2 Jan 19 08:51:14 Tower kernel: mdcmd (21): check CORRECT Jan 19 08:51:14 Tower kernel: md: recovery thread woken up ... Jan 19 08:51:14 Tower kernel: md: recovery thread checking parity... Jan 19 08:51:14 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Jan 19 09:30:41 Tower kernel: md: parity incorrect: 495961960 Jan 19 09:33:28 Tower kernel: md: parity incorrect: 529594768 Jan 19 13:08:44 Tower kernel: mdcmd (22): spindown 2 Jan 19 13:42:28 Tower ntpd[1804]: time reset -0.169712 s Jan 19 13:42:51 Tower ntpd[1804]: synchronized to 208.83.212.8, stratum 2 Jan 19 15:48:58 Tower kernel: md: sync done. time=25065sec rate=77937K/sec Jan 19 15:48:58 Tower kernel: md: recovery thread sync completion status: 0 Quote Link to comment
lionelhutz Posted January 19, 2011 Share Posted January 19, 2011 Try a SMART test on each drive and see if one is showing bad sectors or sectors pending reallocation. Peter Quote Link to comment
Tijuana Posted January 19, 2011 Author Share Posted January 19, 2011 Try a SMART test on each drive and see if one is showing bad sectors or sectors pending reallocation. Peter And I do this how... Quote Link to comment
kizer Posted January 20, 2011 Share Posted January 20, 2011 Found this in the Troubleshooting URL in the Read here first stickied at the top. http://lime-technology.com/wiki/index.php?title=Troubleshooting#Running_a_SMART_test No biggie and I hope it helps you. Quote Link to comment
BRiT Posted January 20, 2011 Share Posted January 20, 2011 smartctl -t short /dev/[sh]d? For instance: smartctl -t short /dev/sda smartctl -t short /dev/sdb smartctl -t short /dev/sdc smartctl -t short /dev/sdd Then wait 2 to 3 minutes, then retrieve the results using: smartctl -a /dev/sda smartctl -a /dev/sdb smartctl -a /dev/sdc smartctl -a /dev/sdd Quote Link to comment
Tijuana Posted January 20, 2011 Author Share Posted January 20, 2011 smartctl -t short /dev/[sh]d? For instance: smartctl -t short /dev/sda smartctl -t short /dev/sdb smartctl -t short /dev/sdc smartctl -t short /dev/sdd Then wait 2 to 3 minutes, then retrieve the results using: smartctl -a /dev/sda smartctl -a /dev/sdb smartctl -a /dev/sdc smartctl -a /dev/sdd So it only takes a few minutes to complete? Do I have to stop the server while doing the tests? Quote Link to comment
Joe L. Posted January 20, 2011 Share Posted January 20, 2011 smartctl -t short /dev/[sh]d? For instance: smartctl -t short /dev/sda smartctl -t short /dev/sdb smartctl -t short /dev/sdc smartctl -t short /dev/sdd Then wait 2 to 3 minutes, then retrieve the results using: smartctl -a /dev/sda smartctl -a /dev/sdb smartctl -a /dev/sdc smartctl -a /dev/sdd So it only takes a few minutes to complete? Do I have to stop the server while doing the tests? No. Quote Link to comment
SSD Posted January 20, 2011 Share Posted January 20, 2011 So after my computer shut down randomly, once I booted up again, the parity check found 2 errors. I decided to do another parity check a few days later, and it still shows 2 errors. This is my log. Using v4.6 AIO Jan 19 08:48:17 Tower emhttp: Spinning up all drives... Jan 19 08:48:17 Tower kernel: mdcmd (18): spinup 0 Jan 19 08:48:17 Tower kernel: mdcmd (19): spinup 1 Jan 19 08:48:17 Tower kernel: mdcmd (20): spinup 2 Jan 19 08:51:14 Tower kernel: mdcmd (21): check CORRECT Jan 19 08:51:14 Tower kernel: md: recovery thread woken up ... Jan 19 08:51:14 Tower kernel: md: recovery thread checking parity... Jan 19 08:51:14 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Jan 19 09:30:41 Tower kernel: md: parity incorrect: 495961960 Jan 19 09:33:28 Tower kernel: md: parity incorrect: 529594768 Jan 19 13:08:44 Tower kernel: mdcmd (22): spindown 2 Jan 19 13:42:28 Tower ntpd[1804]: time reset -0.169712 s Jan 19 13:42:51 Tower ntpd[1804]: synchronized to 208.83.212.8, stratum 2 Jan 19 15:48:58 Tower kernel: md: sync done. time=25065sec rate=77937K/sec Jan 19 15:48:58 Tower kernel: md: recovery thread sync completion status: 0 I remember a long time ago, a person rean a parity check and got a bunch of parity errors, and then reran it and got EXACTLY the same number of parity errors. Turns out one of the drives in the first parity check returned bad data, and then the second parity check basically corrected the parity data. Quote Link to comment
Tijuana Posted January 20, 2011 Author Share Posted January 20, 2011 smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA1594200 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Jan 19 22:11:36 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37980) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off supp ort. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 209 190 021 Pre-fail Always - 4525 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 46 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 164 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 32 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 145 194 Temperature_Celsius 0x0022 113 109 000 Old_age Always - 37 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error # 1 Short offline Completed without error 00% 164 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. That was the parity drive, now this is the 2nd main drive === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37260) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off supp ort. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 167 021 Pre-fail Always - 5858 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 38 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 164 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 25 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 17 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 182 194 Temperature_Celsius 0x0022 113 110 000 Old_age Always - 37 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error # 1 Short offline Completed without error 00% 164 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. So what is the problem? Quote Link to comment
lionelhutz Posted January 20, 2011 Share Posted January 20, 2011 Do all the drives. Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0. Peter Quote Link to comment
Tijuana Posted January 20, 2011 Author Share Posted January 20, 2011 Do all the drives. Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0. Peter This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using. === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST31000340AS Serial Number: 5QJ0X94K Firmware Version: SD15 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Jan 19 22:15:57 2011 EST ==> WARNING: There are known problems with these drives, AND THIS FIRMWARE VERSION IS AFFECTED, see the following Seagate web pages: http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 25) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 642) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off su ort. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 237) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103b) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHE FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always 123578872 3 Spin_Up_Time 0x0003 094 091 000 Pre-fail Always 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always 947 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always 0 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always 4326860246 9 Power_On_Hours 0x0032 092 092 000 Old_age Always 7437 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always 2 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always 966 184 End-to-End_Error 0x0032 100 100 099 Old_age Always 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always 8590065666 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always 0 190 Airflow_Temperature_Cel 0x0022 074 050 045 Old_age Always 26 (Lifetime Min/Max 24/39) 194 Temperature_Celsius 0x0022 026 050 000 Old_age Always 26 (0 17 0 0) 195 Hardware_ECC_Recovered 0x001a 049 013 000 Old_age Always 123578872 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 0 199 UDMA_CRC_Error_Count 0x003e 200 193 000 Old_age Always 19 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) L _of_first_error # 1 Short offline Aborted by host 90% 7437 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Should I install the new firmware or just not even use it? Quote Link to comment
Joe L. Posted January 20, 2011 Share Posted January 20, 2011 Do all the drives. Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0. Peter This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using. I see nothing wrong, what do you see? (other than the warning you need to update the firmware) Joe L. Quote Link to comment
Tijuana Posted January 20, 2011 Author Share Posted January 20, 2011 Do all the drives. Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0. Peter This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using. I see nothing wrong, what do you see? (other than the warning you need to update the firmware) Joe L. 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always 123578872 The others had 0 as the value. Regardless, why am i getting parity errors? Quote Link to comment
Joe L. Posted January 20, 2011 Share Posted January 20, 2011 Do all the drives. Personally, I'm curious if the Reallocated_Sector_Ct or Current_Pending_Sector lines read anything but 0. Peter This is the drive I added in *AFTER* the parity errors. I don't know much about these tests, but I'm guessing this drive isn't the best to be using. I see nothing wrong, what do you see? (other than the warning you need to update the firmware) Joe L. 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always 123578872 The others had 0 as the value. Regardless, why am i getting parity errors? Absolutely nothing wrong. The VALUE of 117 is well above the failure THRESHOLD of 6. The worst it has ever been is 99, still well above the failure threshold. The raw value has meaning ONLY to the manufacturer. Some model drives show a raw value, some do not and show 0. Quote Link to comment
Tijuana Posted January 20, 2011 Author Share Posted January 20, 2011 So should I do another parity check? Quote Link to comment
vca Posted January 20, 2011 Share Posted January 20, 2011 So should I do another parity check? I would say yes. And I would expect that this time you'll get no errors. That said, you might be seeing the same sort of issue that plagued me for at least 6 months last year. I was getting an occasional parity check error and then the next run of parity checking would show the same number of errors and then for a few more parity checks all would be fine until eventually I would get another error or two. This went on for some time until unRAID 4.5.5 came out which added logging of the block numbers of the first 20 parity errors that were detected. Once I saw this I started recording the block numbers and found that the errors in the second parity pass were being reported on the same blocks as the first (well usually, sometimes an extra one would show up...). So I could now see that running the parity check and correct was causing the parity blocks to get flipped between good and bad. At this point I switched to using unMENU's "parity check but do not correct" function for all my parity checks. With this I would see an error or two appear on one pass and then not appear again on subsequent passes. Over time (and many parity checks) I started to notice that all the block numbers were within limited band of about 200,000,000 blocks, or about 100GB of disk surface. I also recorded a number of blocks reappearing several times, out of 45 blocks that reported problems 13 were bad twice, 4 were bad 3 times and one was bad 5 times. I replaced cables, controller card, motherboard, ram, cpu but not the case or PSU. All without and change to the problem. During this time there was very little indication of anything bad in the SMART reports, and I logged smart reports for all my drives after each parity test and then used kdiff to compare them for differences. I replaced two drives which had shown a few SMART errors without affecting this issue. Eventually I decided that the problem had to be a bad drive, but it must be manifesting in a way that was eluding the SMART system. So given the range of blocks that were involved I figured that I could reasonably do an MD5SUM of that block range on each disk, record the first value and then repeat the test a number of times until I found a drive which returned a different value. This approach worked quite well. Initially I ran this script (it exercised the flaky range on all my drives including the parity): dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b > sdg.log dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b > sdi.log dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b > sdd.log dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b > sdb.log dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b > sdh.log dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b > sde.log dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log Just be very careful about using "dd" as it can be used to wipe out a drive... The skip tells dd to skip over the first 141919504 blocks (as I only had a couple of errors with block number less than this) and then the count tells dd to copy the 204597073 (about 100GB) of blocks that follow. In the first pass I repeated the dd|md5sum test three times on each drive and then I looked at the results. On one drive the results looked like: fe1804307062eeb93261b18bb63036bf *- 740bc5c5ef4eab169f627d5c2ae45dfa *- 73c0b467088be422c9999f415e81adab *- BINGO! a different md5 hash each time the drive was tested (and I was pretty certain nothing was writing to it during the test). Since the parity drive's md5 sums were constant through the test I realized that nothing on the array was being written to during the test so these changing values must indicate a drive problem. I did some more tests just limiting my runs to the bad drive and the parity drive and got this: and got this sdc log file (for the parity drive): e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- so the parity did not change, and yet sdb2.log looked like: 73c0b467088be422c9999f415e81adab *- 73c0b467088be422c9999f415e81adab *- 600cb1e4adc284ca913c5ef4d83b9d20 *- 8978f191ba809134207e5f385997aac1 *- c7a4ed6c9c4dd1b71794ce5d94644e72 *- 3cb91ce1e320d7de8e0e302a9d70d2f9 *- 93c82b4ff2b98172ffe9a48266a43361 *- Now note that the magic number 73c0b467088be422c9999f415e81adab has shown up a number of times in the result from the bad drive, so I suspected this was returned when the drive decided to work correctly. Next I replaced the suspect drive with a fresh (precleared!) drive and rebuilt based on parity, and then ran the test script and again got the "good" md5 value of 73c0b467088be422c9999f415e81adab. Since this point in time I have done 18 parity checks without a single error, when before I was typically only getting 2 or 3 (and at most 7) parity checks between errors. I also tried doing a long smart test on the bad drive (several in fact) and these did not change anything. Once I did a preclear of the bad drive (after having removed it from the array) I then found the drive was behaving correctly through dd|md5 tests, so it looks like the preclearing forced the drive to correct the issues, but I'm not putting it back into the array! Here's the last (before preclearing) smart log from the bad drive: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 181 021 Pre-fail Always - 5891 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 438 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8566 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 34 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 186 186 000 Old_age Always - 43744 194 Temperature_Celsius 0x0022 122 115 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 It all looks just fine. The only errors I ever noticed (and these were reported on various drive in the array - not just the bad drive) were the occasional Multi_Zone_Error_Rate, Raw_Read_Error_Rate and a couple of Current_Pending_Sector errors, all of which went away after a few parity check/smart report cycles. Sometimes one of these errors would appear during a parity check with errors, other times they would appear during a parity check without any errors - they were not well correlated. Regards, Stephen Quote Link to comment
Tijuana Posted January 20, 2011 Author Share Posted January 20, 2011 So should I do another parity check? I would say yes. And I would expect that this time you'll get no errors. That said, you might be seeing the same sort of issue that plagued me for at least 6 months last year. I was getting an occasional parity check error and then the next run of parity checking would show the same number of errors and then for a few more parity checks all would be fine until eventually I would get another error or two. This went on for some time until unRAID 4.5.5 came out which added logging of the block numbers of the first 20 parity errors that were detected. Once I saw this I started recording the block numbers and found that the errors in the second parity pass were being reported on the same blocks as the first (well usually, sometimes an extra one would show up...). So I could now see that running the parity check and correct was causing the parity blocks to get flipped between good and bad. At this point I switched to using unMENU's "parity check but do not correct" function for all my parity checks. With this I would see an error or two appear on one pass and then not appear again on subsequent passes. Over time (and many parity checks) I started to notice that all the block numbers were within limited band of about 200,000,000 blocks, or about 100GB of disk surface. I also recorded a number of blocks reappearing several times, out of 45 blocks that reported problems 13 were bad twice, 4 were bad 3 times and one was bad 5 times. I replaced cables, controller card, motherboard, ram, cpu but not the case or PSU. All without and change to the problem. During this time there was very little indication of anything bad in the SMART reports, and I logged smart reports for all my drives after each parity test and then used kdiff to compare them for differences. I replaced two drives which had shown a few SMART errors without affecting this issue. Eventually I decided that the problem had to be a bad drive, but it must be manifesting in a way that was eluding the SMART system. So given the range of blocks that were involved I figured that I could reasonably do an MD5SUM of that block range on each disk, record the first value and then repeat the test a number of times until I found a drive which returned a different value. This approach worked quite well. Initially I ran this script (it exercised the flaky range on all my drives including the parity): dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b > sdg.log dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b > sdi.log dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b > sdd.log dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b > sdb.log dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b > sdh.log dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b > sde.log dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b >> sdf.log dd if=/dev/sdg skip=1471919504 count=204597073 | md5sum -b >> sdg.log dd if=/dev/sdi skip=1471919504 count=204597073 | md5sum -b >> sdi.log dd if=/dev/sdd skip=1471919504 count=204597073 | md5sum -b >> sdd.log dd if=/dev/sdb skip=1471919504 count=204597073 | md5sum -b >> sdb.log dd if=/dev/sdh skip=1471919504 count=204597073 | md5sum -b >> sdh.log dd if=/dev/sde skip=1471919504 count=204597073 | md5sum -b >> sde.log Just be very careful about using "dd" as it can be used to wipe out a drive... The skip tells dd to skip over the first 141919504 blocks (as I only had a couple of errors with block number less than this) and then the count tells dd to copy the 204597073 (about 100GB) of blocks that follow. In the first pass I repeated the dd|md5sum test three times on each drive and then I looked at the results. On one drive the results looked like: fe1804307062eeb93261b18bb63036bf *- 740bc5c5ef4eab169f627d5c2ae45dfa *- 73c0b467088be422c9999f415e81adab *- BINGO! a different md5 hash each time the drive was tested (and I was pretty certain nothing was writing to it during the test). Since the parity drive's md5 sums were constant through the test I realized that nothing on the array was being written to during the test so these changing values must indicate a drive problem. I did some more tests just limiting my runs to the bad drive and the parity drive and got this: and got this sdc log file (for the parity drive): e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- e45e3454fd371b8e907f515f085e95e2 *- so the parity did not change, and yet sdb2.log looked like: 73c0b467088be422c9999f415e81adab *- 73c0b467088be422c9999f415e81adab *- 600cb1e4adc284ca913c5ef4d83b9d20 *- 8978f191ba809134207e5f385997aac1 *- c7a4ed6c9c4dd1b71794ce5d94644e72 *- 3cb91ce1e320d7de8e0e302a9d70d2f9 *- 93c82b4ff2b98172ffe9a48266a43361 *- Now note that the magic number 73c0b467088be422c9999f415e81adab has shown up a number of times in the result from the bad drive, so I suspected this was returned when the drive decided to work correctly. Next I replaced the suspect drive with a fresh (precleared!) drive and rebuilt based on parity, and then ran the test script and again got the "good" md5 value of 73c0b467088be422c9999f415e81adab. Since this point in time I have done 18 parity checks without a single error, when before I was typically only getting 2 or 3 (and at most 7) parity checks between errors. I also tried doing a long smart test on the bad drive (several in fact) and these did not change anything. Once I did a preclear of the bad drive (after having removed it from the array) I then found the drive was behaving correctly through dd|md5 tests, so it looks like the preclearing forced the drive to correct the issues, but I'm not putting it back into the array! Here's the last (before preclearing) smart log from the bad drive: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 181 021 Pre-fail Always - 5891 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 438 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8566 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 34 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 186 186 000 Old_age Always - 43744 194 Temperature_Celsius 0x0022 122 115 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 It all looks just fine. The only errors I ever noticed (and these were reported on various drive in the array - not just the bad drive) were the occasional Multi_Zone_Error_Rate, Raw_Read_Error_Rate and a couple of Current_Pending_Sector errors, all of which went away after a few parity check/smart report cycles. Sometimes one of these errors would appear during a parity check with errors, other times they would appear during a parity check without any errors - they were not well correlated. Regards, Stephen Thanks foe the info. I think I will probably just wait until the new version of unraid become the stable release, and then advanced format my drives. I only have aroudn 1tb of data so far, so I can just copy it to my other drives, format all of it and then put it back on. Seems like a lot of work, but I guess it has to be done! Where can I buy jumpers for these hds? Quote Link to comment
lionelhutz Posted January 20, 2011 Share Posted January 20, 2011 At this point, run a parity check again using the nocorrect option. mdcmd check NOCORRECT Check where any errors occur and see if it's a common spot. Your drives are already advanced format. An advanced format drive refers to how the data is stored internally on the drive. No amount of formatting or a jumper will change how the drive works internally. If you have formatted EARS drives without a jumper then you can delete the partition and let unRAID rebuild the drive with the correct partition. Peter Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.