T1000 Posted October 22, 2011 Share Posted October 22, 2011 Hi guys, I'm running 4.7. I've had a 8 x 2TB drives and one 2TB parity for a while now. I had one other 2TB pre-cleared 3 times ready for use. Our next child was due this week so I thought I would do the last bit of work on the server and fit the 5in3 bay caddy I had so it was all finished. I also added the precleared drive to the array because I was running out of space. Strange thing was it didn't see the new disk (sde) as cleared. My only option was to clear it via the GUI so I did. After it was added to the array I did a parity check and got 488371093 sync errors!! I ran it again and got 16 errors. Ran it again and got 4. Did a powerdown, checked all cables, started up and ran it again and got 20 errors. Syslog is attached. I'm not sure if the new drive and the sync issues are related but i would imagine they are. The baby was born yesterday and will take up most of my time but I could also do with getting to the bottom of these sync errors. If anyone could help me whats wrong or what to do next that would be great. Thanks I'm not sure if running it a few times was the wrong thing to do or not. syslog-20111021-081606.txt.zip Quote Link to comment
dgaschk Posted October 23, 2011 Share Posted October 23, 2011 The log shows that disk9 was not successfully pre-clear before it was added. Please post the pre-clear reports for the drive. They should be found in the /boot directory (or on the flash share). First reestablish correct parity. Remove the new drive and enter "initconfig". Then let parity rebuild and do a parity check. Meanwhile run pre-clear on the new disk and post the results here. Quote Link to comment
T1000 Posted October 23, 2011 Author Share Posted October 23, 2011 Thanks dgaschk , I have disabled the new drive and it's building parity now. Which of these reports should I post? Quote Link to comment
dgaschk Posted October 23, 2011 Share Posted October 23, 2011 When did you run the pre-clear on that disk? If you can't find it just post the report after the next pre-clear. Quote Link to comment
T1000 Posted October 23, 2011 Author Share Posted October 23, 2011 I think it was the end of September. I will just post the new pre-clear report. Thanks Quote Link to comment
T1000 Posted October 23, 2011 Author Share Posted October 23, 2011 So far the parity check is at 50% with no errors. The pre-clear on the 9th disk didn't finish correctly so I'm gonna return it, it's from amazon so it should be really easy to exchange. I wonder why the server pre-cleared it via the GUI and let it be used even though it wasn't pre-cleared properly? Quote Link to comment
Joe L. Posted October 24, 2011 Share Posted October 24, 2011 I wonder why the server pre-cleared it via the GUI and let it be used even though it wasn't pre-cleared properly? Since there was not a valid pre-clear signature, unRAID cleared it itself. It however does not read the entire surface and attempt to identify non-readable sectors, so perform at the least a parity check after the initial parity calculation. Quote Link to comment
T1000 Posted October 25, 2011 Author Share Posted October 25, 2011 The parity check after the parity calculation went fine. Quote Link to comment
Joe L. Posted October 25, 2011 Share Posted October 25, 2011 The parity check after the parity calculation went fine. Now, check for sectors pending re-allocation when next written. Quote Link to comment
T1000 Posted October 25, 2011 Author Share Posted October 25, 2011 The parity check after the parity calculation went fine. Now, check for sectors pending re-allocation when next written. How do I do that? Quote Link to comment
dgaschk Posted October 25, 2011 Share Posted October 25, 2011 Enter "smartctl -a /dev/sdX" where X is the correct drive letter. Post the result here. Quote Link to comment
lionelhutz Posted October 25, 2011 Share Posted October 25, 2011 The parity check after the parity calculation went fine. Just to clarify, I believe this is without this new disk that was giving problems so you should have no issues now. It's a good idea to double check the SMART reports on all the disks anyways. Peter Quote Link to comment
T1000 Posted October 28, 2011 Author Share Posted October 28, 2011 Added a new replacement drive (to the last port on the first SAS-sata on my Supermicro AOC-SASLP-MV8, 8-Port SAS/SATA Card) where the last one was. Tried a preclear and got an image similar to this one (this isn't the actual image): Which I believe to be a preclear fail. I powered down, connected the drive to another sata card I had in the enclosure and did a preclear fine, no errors. Then powered down again, connected the drive to the port it failed on last time and did the 2nd preclear, no problems, no errors, finished fine. Tried the 3rd and last preclear and it looked like the image above. Why would it fail then work, then fail etc? Should I be asking in the preclear thread or is it more likely to be a hardware issue? Here is the SMART report for the disk: root@Tower:~# smartctl -a /dev/sde smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST2000DL003-9VT166 Serial Number: 5YD6C2PB Firmware Version: CC3C User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Oct 28 09:45:10 2011 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 623) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30b7) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 100 006 Pre-fail Always - 43795672 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 060 060 030 Pre-fail Always - 1197262 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 98 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 10 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 073 069 045 Old_age Always - 27 (Lifetime Min/Max 23/27) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 6 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 10 194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 20 0 0) 195 Hardware_ECC_Recovered 0x001a 037 019 000 Old_age Always - 43795672 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 63801739182178 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2313682740 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 4159336131 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@Tower:~# Quote Link to comment
Joe L. Posted October 28, 2011 Share Posted October 28, 2011 It is highly likely to be a hardware issue. the "0002" values on the lines starting at addresses 120, 320, and 500 are not expected. Since these all have the same single "bit" set, it points to a flaky bit in either the disk electronics (possibly its internal RAM used for its onboard cache) , or on the disk controller electronics OR, it could be a noisy/marginal power supply that is electronics can't deal with, and the resulting is symptom is the errant bits set, even if not the actual cause. These types of disks can cause hair loss if not detected early, as they will often present different data each time you read them. (You'll pull your hair out trying to figure out why parity is showing errors when checked. Good luck in isolating the hardware involved. Joe L. Quote Link to comment
T1000 Posted October 28, 2011 Author Share Posted October 28, 2011 The image I posted was from a while ago on a different disk and I posted it because it looked very similar. I will do more preclears until it fails again and take a picture. Quote Link to comment
Joe L. Posted October 28, 2011 Share Posted October 28, 2011 The image I posted was from a while ago on a different disk and I posted it because it looked very similar. I will do more preclears until it fails again and take a picture. If multiple disks present the same symptoms, I would suspect the disk controller, or even system RAM. Perform a memory test on the server for several cycles, preferably overnight, to rule it out. Quote Link to comment
T1000 Posted October 30, 2011 Author Share Posted October 30, 2011 I ran a ramtest the other night: Looks OK to me. Not sure how long it's supposed to run for. When you say disk controller does that mean the Supermicro 8 port card? Could it be the SAS-SATA cable? Either way it's gonna take days to figure out the cause particularly with it not failing every preclear. The preclear is just about to finish with no errors by the looks of it so I will have to start the preclear again to try and get it to fail. If it is the card would the turn around on testing for failure be quicker on a smaller drive? It would also save me from running 20-30 preclears on my new drive. Quote Link to comment
T1000 Posted October 31, 2011 Author Share Posted October 31, 2011 The preclear failed on it's 2nd preclear (it's always on the second preclear) and here is the image which is exactly the same as the one I posted the other day: Should try and get the Supermicro card exchanged? RAM looks fine (as far as I can tell). Same error on different disks. That preclear fail screen is the same now as it was 2 months ago. Is there another test I can do to rule it out? Quote Link to comment
T1000 Posted November 2, 2011 Author Share Posted November 2, 2011 Swapped the 2TB Seagate drive that kept failing on it's 2nd preclear to the other card and it's precleared twice and on it's 3rd preclear now. The spare 1.5TB Samsung drive I had I have attached to the Supermicro card and it's also precleared twice and on it's 3rd preclear. So the 2TB Seagate only fails on it's 2nd preclear via the Supermicro card. Is it because it's 2TB and not 1.5TB? After the 3rd preclear I'm gonna move it back to the supermicro card and do some preclears. I just wish it didn't take so long! If any one has any idea's on how I can find the casue of the problem would be great, thanks. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.