New pre-cleared disk needed clearing & lots of sync errors during parity check.

T1000 · October 22, 2011

Hi guys,

I'm running 4.7.

I've had a 8 x 2TB drives and one 2TB parity for a while now. I had one other 2TB pre-cleared 3 times ready for use.

Our next child was due this week so I thought I would do the last bit of work on the server and fit the 5in3 bay caddy I had so it was all finished. I also added the precleared drive to the array because I was running out of space.

Strange thing was it didn't see the new disk (sde) as cleared. My only option was to clear it via the GUI so I did.

After it was added to the array I did a parity check and got 488371093 sync errors!!

I ran it again and got 16 errors. Ran it again and got 4. Did a powerdown, checked all cables, started up and ran it again and got 20 errors.

Syslog is attached.

I'm not sure if the new drive and the sync issues are related but i would imagine they are.

The baby was born yesterday and will take up most of my time but I could also do with getting to the bottom of these sync errors. If anyone could help me whats wrong or what to do next that would be great.

Thanks

I'm not sure if running it a few times was the wrong thing to do or not.

syslog-20111021-081606.txt.zip

dgaschk · October 23, 2011

The log shows that disk9 was not successfully pre-clear before it was added. Please post the pre-clear reports for the drive. They should be found in the /boot directory (or on the flash share).

First reestablish correct parity. Remove the new drive and enter "initconfig". Then let parity rebuild and do a parity check. Meanwhile run pre-clear on the new disk and post the results here.

T1000 · October 23, 2011

Thanks dgaschk ,

I have disabled the new drive and it's building parity now.

Which of these reports should I post?

dgaschk · October 23, 2011

When did you run the pre-clear on that disk? If you can't find it just post the report after the next pre-clear.

T1000 · October 23, 2011

I think it was the end of September. I will just post the new pre-clear report. Thanks

T1000 · October 23, 2011

So far the parity check is at 50% with no errors.

The pre-clear on the 9th disk didn't finish correctly so I'm gonna return it, it's from amazon so it should be really easy to exchange.

I wonder why the server pre-cleared it via the GUI and let it be used even though it wasn't pre-cleared properly?

Joe L. · October 24, 2011

I wonder why the server pre-cleared it via the GUI and let it be used even though it wasn't pre-cleared properly?

Since there was not a valid pre-clear signature, unRAID cleared it itself.

It however does not read the entire surface and attempt to identify non-readable sectors, so perform at the least a parity check after the initial parity calculation.

T1000 · October 25, 2011

The parity check after the parity calculation went fine.

Joe L. · October 25, 2011

The parity check after the parity calculation went fine.

Now, check for sectors pending re-allocation when next written.

T1000 · October 25, 2011

The parity check after the parity calculation went fine.

Now, check for sectors pending re-allocation when next written.

How do I do that?

dgaschk · October 25, 2011

Enter "smartctl -a /dev/sdX" where X is the correct drive letter. Post the result here.

lionelhutz · October 25, 2011

The parity check after the parity calculation went fine.

Just to clarify, I believe this is without this new disk that was giving problems so you should have no issues now.

It's a good idea to double check the SMART reports on all the disks anyways.

Peter

T1000 · October 28, 2011

Added a new replacement drive (to the last port on the first SAS-sata on my Supermicro AOC-SASLP-MV8, 8-Port SAS/SATA Card) where the last one was. Tried a preclear and got an image similar to this one (this isn't the actual image):

Which I believe to be a preclear fail.

I powered down, connected the drive to another sata card I had in the enclosure and did a preclear fine, no errors.

Then powered down again, connected the drive to the port it failed on last time and did the 2nd preclear, no problems, no errors, finished fine.

Tried the 3rd and last preclear and it looked like the image above.

Why would it fail then work, then fail etc?

Should I be asking in the preclear thread or is it more likely to be a hardware issue?

Here is the SMART report for the disk:

root@Tower:~# smartctl -a /dev/sde
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST2000DL003-9VT166
Serial Number:    5YD6C2PB
Firmware Version: CC3C
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Oct 28 09:45:10 2011 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                       without error or no self-test has ever 
                                       been run.
Total time to complete Offline 
data collection:                 ( 623) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   112   100   006    Pre-fail  Always       -       43795672
 3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       10
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000f   060   060   030    Pre-fail  Always       -       1197262
 9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       98
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       10
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   073   069   045    Old_age   Always       -       27 (Lifetime Min/Max 23/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       10
194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always       -       27 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   037   019   000    Old_age   Always       -       43795672
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       63801739182178
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2313682740
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4159336131

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@Tower:~#

Joe L. · October 28, 2011

It is highly likely to be a hardware issue.

the "0002" values on the lines starting at addresses 120, 320, and 500 are not expected. Since these all have the same single "bit" set, it points to a flaky bit in either the disk electronics (possibly its internal RAM used for its onboard cache) , or on the disk controller electronics OR, it could be a noisy/marginal power supply that is electronics can't deal with, and the resulting is symptom is the errant bits set, even if not the actual cause.

These types of disks can cause hair loss if not detected early, as they will often present different data each time you read them. (You'll pull your hair out trying to figure out why parity is showing errors when checked.

Good luck in isolating the hardware involved.

Joe L.

T1000 · October 28, 2011

The image I posted was from a while ago on a different disk and I posted it because it looked very similar.

I will do more preclears until it fails again and take a picture.

Joe L. · October 28, 2011

The image I posted was from a while ago on a different disk and I posted it because it looked very similar.

I will do more preclears until it fails again and take a picture.

If multiple disks present the same symptoms, I would suspect the disk controller, or even system RAM. Perform a memory test on the server for several cycles, preferably overnight, to rule it out.

T1000 · October 30, 2011

I ran a ramtest the other night:

Looks OK to me. Not sure how long it's supposed to run for.

When you say disk controller does that mean the Supermicro 8 port card?

Could it be the SAS-SATA cable?

Either way it's gonna take days to figure out the cause particularly with it not failing every preclear.

The preclear is just about to finish with no errors by the looks of it so I will have to start the preclear again to try and get it to fail.

If it is the card would the turn around on testing for failure be quicker on a smaller drive? It would also save me from running 20-30 preclears on my new drive.

T1000 · October 31, 2011

The preclear failed on it's 2nd preclear (it's always on the second preclear) and here is the image which is exactly the same as the one I posted the other day:

Should try and get the Supermicro card exchanged? RAM looks fine (as far as I can tell). Same error on different disks. That preclear fail screen is the same now as it was 2 months ago. Is there another test I can do to rule it out?

T1000 · November 2, 2011

Swapped the 2TB Seagate drive that kept failing on it's 2nd preclear to the other card and it's precleared twice and on it's 3rd preclear now.

The spare 1.5TB Samsung drive I had I have attached to the Supermicro card and it's also precleared twice and on it's 3rd preclear.

So the 2TB Seagate only fails on it's 2nd preclear via the Supermicro card. Is it because it's 2TB and not 1.5TB? After the 3rd preclear I'm gonna move it back to the supermicro card and do some preclears. I just wish it didn't take so long!

If any one has any idea's on how I can find the casue of the problem would be great, thanks.

New pre-cleared disk needed clearing & lots of sync errors during parity check.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation