dl Posted May 10, 2010 Share Posted May 10, 2010 Hi all, I keep getting sync errors (over 1000+) every time I started the parity check. I have tried 4.4.x and 4.5.x with the same issue. The machine had passed memtest. I am just wondering if it is most likely a hardware issue. If so, what is the best way to isolate the issue (between hard disk, controller, cable, etc)? Regards, dl Quote Link to comment
BRiT Posted May 10, 2010 Share Posted May 10, 2010 Have you examined your syslog? Have you run HD SMART tests? Quote Link to comment
purko Posted May 10, 2010 Share Posted May 10, 2010 I am just wondering if it is most likely a hardware issue. If so, what is the best way to isolate the issue (between hard disk, controller, cable, etc)? http://lime-technology.com/wiki/index.php?title=Troubleshooting#How_to_get_help Quote Link to comment
dl Posted May 10, 2010 Author Share Posted May 10, 2010 here is the syslog and smartctl output. I replaced the true file name with /xxxx/xxx/xxx. Thanks in advance! dl syslog.txt smartctl.txt Quote Link to comment
Joe L. Posted May 10, 2010 Share Posted May 10, 2010 here is the syslog and smartctl output. I replaced the true file name with /xxxx/xxx/xxx. Thanks in advance! dl The maxtor SMART report shows: 5 Reallocated_Sector_Ct 0x0033 248 248 063 Pre-fail Always - 57 Run another parity check and see if the re-allocated sector count increases. If it does, the odds are it is the disk causing your errors. Quote Link to comment
dl Posted May 10, 2010 Author Share Posted May 10, 2010 It seemed that the only error count changes was the following: old value: 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 39273 new value: 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 39736 regards, dl smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Maxtor DiamondMax 16 family Device Model: Maxtor 4R120L0 Serial Number: R42GVE3E Firmware Version: RAMB1UU0 User Capacity: 122,942,324,736 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is: Mon May 10 14:01:59 2010 GMT+8 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 19) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 182) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 74) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0027 220 220 063 Pre-fail Always - 8671 4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 1214 5 Reallocated_Sector_Ct 0x0033 248 248 063 Pre-fail Always - 57 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 252 238 187 Pre-fail Always - 43654 9 Power_On_Minutes 0x0032 240 240 000 Old_age Always - 448h+41m 10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail Always - 0 11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 252 252 000 Old_age Always - 457 192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 30 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 39736 196 Reallocated_Event_Count 0x0008 252 252 000 Old_age Offline - 1 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 0 202 TA_Increase_Count 0x000a 253 252 000 Old_age Always - 0 203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail Always - 0 204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0 205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0 207 Spin_High_Current 0x002a 253 252 000 Old_age Always - 0 208 Spin_Buzz 0x002a 253 252 000 Old_age Always - 0 209 Offline_Seek_Performnce 0x0024 139 139 000 Old_age Offline - 0 99 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 100 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 101 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 SMART Error Log Version: 1 Warning: ATA error count 7 inconsistent with error log pointer 5 ATA Error Count: 7 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 7 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 01 00 00 00 e0 08 00:04:32.480 READ DMA c6 00 10 00 00 00 e0 08 00:04:32.480 SET MULTIPLE MODE 91 00 3f 00 00 00 af 08 00:04:32.480 INITIALIZE DEVICE PARAMETERS [OBS-6] 10 00 00 00 00 00 a0 08 00:04:32.464 RECALIBRATE [OBS-4] c8 00 01 00 00 00 e0 04 00:04:32.464 READ DMA Error 6 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 01 00 00 00 e0 08 00:04:32.384 READ DMA c6 00 10 00 00 00 e0 08 00:04:32.384 SET MULTIPLE MODE 91 00 3f 00 00 00 af 08 00:04:32.384 INITIALIZE DEVICE PARAMETERS [OBS-6] 10 00 00 00 00 00 a0 08 00:04:32.368 RECALIBRATE [OBS-4] c8 00 01 00 00 00 e0 04 00:04:32.368 READ DMA Error 5 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 01 00 00 00 e0 08 00:04:32.304 READ DMA c6 00 10 00 00 00 e0 08 00:04:32.304 SET MULTIPLE MODE 91 00 3f 00 00 00 af 08 00:04:32.304 INITIALIZE DEVICE PARAMETERS [OBS-6] 10 00 00 00 00 00 a0 08 00:04:32.272 RECALIBRATE [OBS-4] c8 00 01 00 00 00 e0 04 00:04:32.272 READ DMA Error 4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 01 00 00 00 e0 08 00:04:32.208 READ DMA c6 00 10 00 00 00 e0 08 00:04:32.208 SET MULTIPLE MODE 91 00 3f 00 00 00 af 08 00:04:32.208 INITIALIZE DEVICE PARAMETERS [OBS-6] 10 00 00 00 00 00 a0 08 00:04:32.176 RECALIBRATE [OBS-4] c8 00 01 00 00 00 e0 04 00:04:32.176 READ DMA Error 3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 01 00 00 00 e0 08 00:04:32.112 READ DMA c6 00 10 00 00 00 e0 08 00:04:32.112 SET MULTIPLE MODE 91 00 3f 00 00 00 af 08 00:04:32.112 INITIALIZE DEVICE PARAMETERS [OBS-6] 10 00 00 00 00 00 a0 08 00:04:32.096 RECALIBRATE [OBS-4] e3 00 00 00 aa 00 a0 04 00:04:32.096 IDLE SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Aborted by host 30% 3160 - # 2 Extended offline Completed without error 00% 3146 - # 3 Extended offline Completed without error 00% 3129 - # 4 Extended offline Completed without error 00% 3120 - # 5 Extended offline Completed without error 00% 3090 - # 6 Extended offline Completed without error 00% 3071 - # 7 Extended offline Completed without error 00% 3036 - # 8 Extended offline Completed without error 00% 3023 - # 9 Extended offline Completed without error 00% 2986 - #10 Extended offline Completed without error 00% 2982 - #11 Extended offline Completed without error 00% 2968 - #12 Extended offline Completed without error 00% 2953 - #13 Extended offline Aborted by host 30% 2939 - #14 Extended offline Aborted by host 20% 2931 - #15 Extended offline Aborted by host 30% 2915 - #16 Extended offline Completed without error 00% 2893 - #17 Extended offline Aborted by host 40% 2890 - #18 Extended offline Aborted by host 20% 2883 - #19 Extended offline Completed without error 00% 2864 - #20 Extended offline Completed without error 00% 2853 - #21 Extended offline Completed without error 00% 2843 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
Joe L. Posted May 11, 2010 Share Posted May 11, 2010 did you get more parity errors the last time you did the parity check? Quote Link to comment
dl Posted May 11, 2010 Author Share Posted May 11, 2010 I was still getting over 1000+ sync error. The number seemed to between 1000 to over 2000. Thanks! dl Quote Link to comment
Joe L. Posted May 11, 2010 Share Posted May 11, 2010 I was still getting over 1000+ sync error. The number seemed to between 1000 to over 2000. Thanks! dl That then leaves One of the hard disks The Power supply The Motherboard The disk controller card Memory It is not real likely to be a cabling issue without there being errors in the syslog. Since you already checked memory, try un-assigning one data disk, "press restore", and perform a parity calc followed by a parity check. If successful, the disk you un-assigned is probably your bad disk. If still failing, re-assign that disk, un-assign the next data disk, press "restore", perform a parity calc followed by a parity check., repeat until you run out of data disks. (It could still be the parity disk that is bad) Pressing "restore" will immediately invalidate PARITY and set a new disk configuration based on the currently assigned and working disks. (It does not restore anything) It is actually a "Delete Disk Configuration and Parity" button. Joe L. Quote Link to comment
dl Posted May 13, 2010 Author Share Posted May 13, 2010 It was the disk 2 giving the errors. After remove the disk2 from the array, there is no more sync error. Here is my next question. Since the parity disk is not valid any more after the testing, how can I replace(upgrade) disk2 from 1TB to 1.5TB without loosing any data on it? Thanks in advance. dl Quote Link to comment
Joe L. Posted May 13, 2010 Share Posted May 13, 2010 It was the disk 2 giving the errors. After remove the disk2 from the array, there is no more sync error. Here is my next question. Since the parity disk is not valid any more after the testing, how can I replace(upgrade) disk2 from 1TB to 1.5TB without loosing any data on it? Thanks in advance. dl Unfortunately, parity has calculations based on exactly what was last read from disk2. If disk2 gave inconsistent results, then whatever was read last is what will be restored from parity in combination with the other disks. I'd go ahead and perform the upgrade. Then, I'd perform a reiserfsck check of the disk, to make sure the bits that were inconsistent did not trash the file-system. (odds are in your favor) Lastly, all you can do is verification checksums with the original sources (if you have them) Glad you found the bad disk. Joe L. Quote Link to comment
dl Posted May 13, 2010 Author Share Posted May 13, 2010 Hi Joe, Do you have detailed instructions to upgrade the disk? What command should I use to do? Thanks! dl Quote Link to comment
Joe L. Posted May 13, 2010 Share Posted May 13, 2010 Hi Joe, Do you have detailed instructions to upgrade the disk? What command should I use to do? Thanks! dl It is pretty easy... Stop the array Power down Remove disk2 an replace it with the new replacement. I can see you have a 1.5Gig parity drive already. The replacement for disk2 must be as large as disk2, or larger, but not larger than the parity disk. It is OK for it to be the same size as the parity disk. Power up. The array will not automatically start, but it will say something about disk2 being upgraded. The actual upgrade will not occur until you press the "Start" button. (You'll probably need to click the "I'm sure" checkbox under the "Start" button to enable it. Press the "Start" button. The array will begin the process of re-constructing the old contents of disk2 onto its replacement. That's it, other than waiting for the reconstruction to finish. Note: Whatever you do, DO NOT press the button labeled as "restore." It is very poorly labeled. It should be labeled as "Delete Disk Configuration and Parity" Its description should say that pressing it delete the existing disk configuration and that when you next press "Start" a new disk configuration will be stored, and a completely new parity calculation will begin based on the new disk configuration. Pressing "restore" immediately invalidates any prior parity calculations, as if you had never performed them. It is NOT what you want to do when replacing a drive. So again, do not be fooled into using the button labele as "restore" as it has absolutely NOTHING to do with re-building data on a replacement drive. Press "Start" to begin the re-construction process. Once the re-construction process begins your array will be on-line and everything accessible, including the contents of the drive being re-constructed. You will not be parity protected from a second failure until the replacement drive is completely re-constructed. The re-construction will take a bit longer than a normal parity check, since writes to a drive are typically slower than reads from it. Quote Link to comment
dl Posted May 18, 2010 Author Share Posted May 18, 2010 Hi all, I had replaced the broken ones (1TB) with a bigger one (1.5TB), and rebuilt the array. Everything seems to be fine on the web management console. No sync error. All drive shows the green status. However when I tried to copy some files to the server, I got errors. In the log, it says "attempt to access beyond end of device" in the log. Please see the attached log for more info. Thank you in advance! dl log.txt Quote Link to comment
Joe L. Posted May 18, 2010 Share Posted May 18, 2010 Hi all, I had replaced the broken ones (1TB) with a bigger one (1.5TB), and rebuilt the array. Everything seems to be fine on the web management console. No sync error. All drive shows the green status. However when I tried to copy some files to the server, I got errors. In the log, it says "attempt to access beyond end of device" in the log. Please see the attached log for more info. Thank you in advance! dl It appears as if you might want to perform a file-system check on disk3. Instructions in the wiki here: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems Joe L. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.