always Sync error


dl

Recommended Posts

Hi all,

 

I keep getting sync errors (over 1000+) every time I started the parity check. I have tried 4.4.x and 4.5.x with the same issue. The machine had passed memtest. 

 

I am just wondering if it is most likely a hardware issue. If so, what is the best way to isolate the issue (between hard disk, controller, cable, etc)?

 

Regards,

 

dl

Link to comment

here is the syslog and smartctl output. I replaced the true file name with /xxxx/xxx/xxx.

 

Thanks in advance!

 

dl

 

The maxtor SMART report shows:

  5 Reallocated_Sector_Ct  0x0033  248  248  063    Pre-fail  Always      -      57

 

Run another parity check and see if the re-allocated sector count increases.  If it does, the odds are it is the disk causing your errors.

 

 

Link to comment

It seemed that the only error count changes was the following:

 

old value:

195 Hardware_ECC_Recovered  0x000a  253  252  000    Old_age  Always      -      39273

new value:

195 Hardware_ECC_Recovered  0x000a  253  252  000    Old_age  Always      -      39736

 

regards,

 

dl

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF INFORMATION SECTION ===

Model Family:    Maxtor DiamondMax 16 family

Device Model:    Maxtor 4R120L0

Serial Number:    R42GVE3E

Firmware Version: RAMB1UU0

User Capacity:    122,942,324,736 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  7

ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0

Local Time is:    Mon May 10 14:01:59 2010 GMT+8

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  19) The self-test routine was aborted by

the host.

Total time to complete Offline

data collection: ( 182) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

No General Purpose Logging support.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: (  74) minutes.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  3 Spin_Up_Time            0x0027  220  220  063    Pre-fail  Always      -      8671

  4 Start_Stop_Count        0x0032  253  253  000    Old_age  Always      -      1214

  5 Reallocated_Sector_Ct  0x0033  248  248  063    Pre-fail  Always      -      57

  6 Read_Channel_Margin    0x0001  253  253  100    Pre-fail  Offline      -      0

  7 Seek_Error_Rate        0x000a  253  252  000    Old_age  Always      -      0

  8 Seek_Time_Performance  0x0027  252  238  187    Pre-fail  Always      -      43654

  9 Power_On_Minutes        0x0032  240  240  000    Old_age  Always      -      448h+41m

10 Spin_Retry_Count        0x002b  253  252  157    Pre-fail  Always      -      0

11 Calibration_Retry_Count 0x002b  253  252  223    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  252  252  000    Old_age  Always      -      457

192 Power-Off_Retract_Count 0x0032  253  253  000    Old_age  Always      -      0

193 Load_Cycle_Count        0x0032  253  253  000    Old_age  Always      -      0

194 Temperature_Celsius    0x0032  253  253  000    Old_age  Always      -      30

195 Hardware_ECC_Recovered  0x000a  253  252  000    Old_age  Always      -      39736

196 Reallocated_Event_Count 0x0008  252  252  000    Old_age  Offline      -      1

197 Current_Pending_Sector  0x0008  253  253  000    Old_age  Offline      -      0

198 Offline_Uncorrectable  0x0008  253  253  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0008  199  199  000    Old_age  Offline      -      0

200 Multi_Zone_Error_Rate  0x000a  253  252  000    Old_age  Always      -      0

201 Soft_Read_Error_Rate    0x000a  253  252  000    Old_age  Always      -      0

202 TA_Increase_Count      0x000a  253  252  000    Old_age  Always      -      0

203 Run_Out_Cancel          0x000b  253  252  180    Pre-fail  Always      -      0

204 Shock_Count_Write_Opern 0x000a  253  252  000    Old_age  Always      -      0

205 Shock_Rate_Write_Opern  0x000a  253  252  000    Old_age  Always      -      0

207 Spin_High_Current      0x002a  253  252  000    Old_age  Always      -      0

208 Spin_Buzz              0x002a  253  252  000    Old_age  Always      -      0

209 Offline_Seek_Performnce 0x0024  139  139  000    Old_age  Offline      -      0

99 Unknown_Attribute      0x0004  253  253  000    Old_age  Offline      -      0

100 Unknown_Attribute      0x0004  253  253  000    Old_age  Offline      -      0

101 Unknown_Attribute      0x0004  253  253  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

Warning: ATA error count 7 inconsistent with error log pointer 5

 

ATA Error Count: 7 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 7 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was in an unknown state.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 00 00 00 e0 08      00:04:32.480  READ DMA

  c6 00 10 00 00 00 e0 08      00:04:32.480  SET MULTIPLE MODE

  91 00 3f 00 00 00 af 08      00:04:32.480  INITIALIZE DEVICE PARAMETERS [OBS-6]

  10 00 00 00 00 00 a0 08      00:04:32.464  RECALIBRATE [OBS-4]

  c8 00 01 00 00 00 e0 04      00:04:32.464  READ DMA

 

Error 6 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was in an unknown state.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 00 00 00 e0 08      00:04:32.384  READ DMA

  c6 00 10 00 00 00 e0 08      00:04:32.384  SET MULTIPLE MODE

  91 00 3f 00 00 00 af 08      00:04:32.384  INITIALIZE DEVICE PARAMETERS [OBS-6]

  10 00 00 00 00 00 a0 08      00:04:32.368  RECALIBRATE [OBS-4]

  c8 00 01 00 00 00 e0 04      00:04:32.368  READ DMA

 

Error 5 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was in an unknown state.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 00 00 00 e0 08      00:04:32.304  READ DMA

  c6 00 10 00 00 00 e0 08      00:04:32.304  SET MULTIPLE MODE

  91 00 3f 00 00 00 af 08      00:04:32.304  INITIALIZE DEVICE PARAMETERS [OBS-6]

  10 00 00 00 00 00 a0 08      00:04:32.272  RECALIBRATE [OBS-4]

  c8 00 01 00 00 00 e0 04      00:04:32.272  READ DMA

 

Error 4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was in an unknown state.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 00 00 00 e0 08      00:04:32.208  READ DMA

  c6 00 10 00 00 00 e0 08      00:04:32.208  SET MULTIPLE MODE

  91 00 3f 00 00 00 af 08      00:04:32.208  INITIALIZE DEVICE PARAMETERS [OBS-6]

  10 00 00 00 00 00 a0 08      00:04:32.176  RECALIBRATE [OBS-4]

  c8 00 01 00 00 00 e0 04      00:04:32.176  READ DMA

 

Error 3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was in an unknown state.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 00 00 00 e0 08      00:04:32.112  READ DMA

  c6 00 10 00 00 00 e0 08      00:04:32.112  SET MULTIPLE MODE

  91 00 3f 00 00 00 af 08      00:04:32.112  INITIALIZE DEVICE PARAMETERS [OBS-6]

  10 00 00 00 00 00 a0 08      00:04:32.096  RECALIBRATE [OBS-4]

  e3 00 00 00 aa 00 a0 04      00:04:32.096  IDLE

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Aborted by host              30%      3160        -

# 2  Extended offline    Completed without error      00%      3146        -

# 3  Extended offline    Completed without error      00%      3129        -

# 4  Extended offline    Completed without error      00%      3120        -

# 5  Extended offline    Completed without error      00%      3090        -

# 6  Extended offline    Completed without error      00%      3071        -

# 7  Extended offline    Completed without error      00%      3036        -

# 8  Extended offline    Completed without error      00%      3023        -

# 9  Extended offline    Completed without error      00%      2986        -

#10  Extended offline    Completed without error      00%      2982        -

#11  Extended offline    Completed without error      00%      2968        -

#12  Extended offline    Completed without error      00%      2953        -

#13  Extended offline    Aborted by host              30%      2939        -

#14  Extended offline    Aborted by host              20%      2931        -

#15  Extended offline    Aborted by host              30%      2915        -

#16  Extended offline    Completed without error      00%      2893        -

#17  Extended offline    Aborted by host              40%      2890        -

#18  Extended offline    Aborted by host              20%      2883        -

#19  Extended offline    Completed without error      00%      2864        -

#20  Extended offline    Completed without error      00%      2853        -

#21  Extended offline    Completed without error      00%      2843        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Link to comment

I was still getting over 1000+ sync error. The number seemed to between 1000 to over 2000. Thanks!

 

dl

That then leaves

One of the hard disks

The Power supply

The Motherboard

The disk controller card

Memory

 

It is not real likely to be a cabling issue without there being errors in the syslog.

 

Since you already checked memory, try un-assigning one data disk, "press restore", and perform a parity calc followed by a parity  check.

 

If successful, the disk you un-assigned is probably your bad disk.  If still failing, re-assign that disk, un-assign the next data disk, press "restore", perform a parity calc followed by a parity check.,

 

repeat until you run out of data disks.  (It could still be the parity disk that is bad)

 

Pressing "restore" will immediately invalidate PARITY and set a new disk configuration based on the currently assigned and working disks.  (It does not restore anything)  It is actually a "Delete Disk Configuration and Parity" button.

 

Joe L.

Link to comment

It was the disk 2 giving the errors. After remove the disk2 from the array, there is no more sync error.

 

Here is my next question. Since the parity disk is not valid any more after the testing, how can I replace(upgrade) disk2 from 1TB to 1.5TB without loosing any data on it?

 

Thanks in advance. dl

Link to comment

It was the disk 2 giving the errors. After remove the disk2 from the array, there is no more sync error.

 

Here is my next question. Since the parity disk is not valid any more after the testing, how can I replace(upgrade) disk2 from 1TB to 1.5TB without loosing any data on it?

 

Thanks in advance. dl

Unfortunately, parity has calculations based on exactly what was last read from disk2.  If disk2 gave inconsistent results, then whatever was read last is what will be restored from parity in combination with the other disks.

 

I'd go ahead and perform the upgrade.  Then, I'd perform a reiserfsck check of the disk, to make sure the bits that were inconsistent did not trash the file-system. (odds are in your favor)  Lastly, all you can do is verification checksums with the original sources (if you have them) 

 

Glad you found the bad disk.

 

Joe L.

Link to comment

Hi Joe,

 

Do you have detailed instructions to upgrade the disk? What command should I use to do?

 

Thanks!

 

dl

It is pretty easy...

 

  • Stop the array
  • Power down
  • Remove disk2 an replace it with the new replacement.  I can see you have a 1.5Gig parity drive already.  The replacement for disk2 must be as large as disk2, or larger, but not larger than the parity disk.  It is OK for it to be the same size as the parity disk.
  • Power up.  The array will not automatically start, but it will say something about disk2 being upgraded.  The actual upgrade will not occur until you press the "Start" button.  (You'll probably need to click the "I'm sure" checkbox under the "Start" button to enable it.
  • Press the "Start" button.  The array will begin the process of re-constructing the old contents of disk2 onto its replacement.

 

That's it, other than waiting for the reconstruction to finish.

 

Note: Whatever you do, DO NOT press the button labeled as "restore."  It is very poorly labeled. 

It should be labeled as "Delete Disk Configuration and Parity"   Its description should say that pressing it delete the existing disk configuration and that when you next press "Start" a new disk configuration will be stored, and a completely new parity calculation will begin based on the new disk configuration.  Pressing "restore" immediately invalidates any prior parity calculations, as if you had never performed them.  It is NOT what you want to do when replacing a drive.  So again, do not be fooled into using the button labele as "restore" as it has absolutely NOTHING to do with re-building data on a replacement drive.  Press "Start" to begin the re-construction process.

 

Once the re-construction process begins your array will be on-line and everything accessible, including the contents of the drive being re-constructed.  You will not be parity protected from a second failure until the replacement drive is completely re-constructed.  The re-construction will take a bit longer than a normal parity check, since writes to a drive are typically slower than reads from it.

 

 

Link to comment

Hi all,

 

I had replaced the broken ones (1TB) with a bigger one (1.5TB), and rebuilt the array. Everything seems to be fine on the web management console. No sync error. All drive shows the green status. However when I tried to copy some files to the server, I got errors. In the log, it says "attempt to access beyond end of device" in the log. Please see the attached log for more info.

 

Thank you in advance!

 

dl

log.txt

Link to comment

Hi all,

 

I had replaced the broken ones (1TB) with a bigger one (1.5TB), and rebuilt the array. Everything seems to be fine on the web management console. No sync error. All drive shows the green status. However when I tried to copy some files to the server, I got errors. In the log, it says "attempt to access beyond end of device" in the log. Please see the attached log for more info.

 

Thank you in advance!

 

dl

It appears as if you might want to perform a file-system check on disk3.

 

Instructions in the wiki here: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

 

Joe L.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.