fitbrit Posted January 31, 2011 Share Posted January 31, 2011 I'm running 4.6 rc5. The 20 data drive server comprises a Centurion 590 with 12 drives; parity at top, then cache drive, followed by data drives 1-10. The bottom 4 drives are run off a Supermicro 8-port PCIe x4 card. The remaining ten drives are in two Sans digital 5-bay eSATA enclosures, run by a single Sil3132(?) eSATA PCIe x1 card. Currently drives 12 and 13 have been removed and were going to be replaced by bigger drives. After some recent problems, I mistakenly believed all was well because I successfully rebuilt parity with a new Seagate 2TB LP drive. However, when I started a parity check, I'd always have lots of errors shown very quickly. Having read about some of the problems with the model of parity drive I use, I decided to run a non-correcting parity check to completion. The results were alarming, including tens if not hundreds of thousands reported in the area which reports how much progress in the check has been made, and tens of thousands in the main disk status area. Additionally, I'm still hearing a clicking from the main server case. At first I though it was my cache drive, but am now pretty sure it's not that one. We just had a power outage which lasted longer than my UPS was able to support while I was out. When I returned and restarted the server, it took some time to mount the drives. The parity check that started showed lots of errors very quickly again. Now I'm not sure what to do. I don't trust the parity, but am not sure whether it's due to a bad parity drive or another drive that's failing. I'm attaching the log from the unmenu readout. Any help appreciated. Jan302011logPt1.txt Link to comment
fitbrit Posted January 31, 2011 Author Share Posted January 31, 2011 Part 2 of syslog... And SMART report on parity drive: smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST32000542AS Serial Number: 5XW1FDFJ Firmware Version: CC34 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Mon Jan 31 12:20:52 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 623) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 097 082 006 Pre-fail Always - 43827459 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 094 094 020 Old_age Always - 6607 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 10150913 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 981 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 094 094 020 Old_age Always - 6512 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 8391 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 077 064 045 Old_age Always - 23 (Lifetime Min/Max 23/33) 194 Temperature_Celsius 0x0022 023 040 000 Old_age Always - 23 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 049 040 000 Old_age Always - 43827459 197 Current_Pending_Sector 0x0012 100 095 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 095 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 210273008879140 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3018893516 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1466551571 SMART Error Log Version: 1 ATA Error Count: 8466 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 8466 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:27.714 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:27.713 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:27.712 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:27.712 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:27.688 READ NATIVE MAX ADDRESS EXT Error 8465 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:23.929 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:23.929 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:23.928 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:23.927 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:23.903 READ NATIVE MAX ADDRESS EXT Error 8464 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:20.176 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:20.175 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:20.174 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:20.174 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:20.150 READ NATIVE MAX ADDRESS EXT Error 8463 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:16.412 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:16.411 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:16.410 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:16.410 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:16.386 READ NATIVE MAX ADDRESS EXT Error 8462 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a9 6d c9 00 Error: UNC at LBA = 0x00c96da9 = 13200809 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 af 6b c9 e0 00 00:00:12.665 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:12.664 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:12.663 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:12.663 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:12.637 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Jan302011logPt2.txt Link to comment
bcbgboy13 Posted January 31, 2011 Share Posted January 31, 2011 Hi fitbrit, Thanks to your post in another place I found about Unraid and I will try to help you here as much as I can. You are using a rather unusual configuration and I believe this is why the people are not proposing any ideas - after all a wrong idea can lead to a data loss and no one wants to venture there. Now I am a pure hardware guy and will wait for the Linux gurus to come with a suggestions regarding the software side but while waiting I will throw some ideas. 1. Please post your complete hardware configuration - motherboard model with BIOS if possible and the PSU you are using. 2. Your syslog is incomplete - there are a lot of lines before the "part1" you posted - basically the area when Unraid boots - I like to see the enumeration of the different devices as you have this exotic "external" drives and I have never seen configuration as yours. 3. You are using a Seagate 2TB LP hard drive as parity - and it is still with the original (but discuses as bad) firmware CC34. Now I have not used personally Seagate in more that 10 years and also personally consider this particular model to be a "poor" choice but I do try keep in touch with the various discussions and I have heard that the clicking noise is often sign of not good power to the drive. There is also consensus that the "RAW" data on the SMART report has meanings only to the HD manufacturers (especially Seagate's HD) but on other hand some of the attributes look like a real data - temperature, power on hours, etc. Now according to this SMART report you have used this HD for 981 hours (good match with SMART error log about the last 5 errors) - look like it is the actual data. Attribute 12 is "power cycle hours" and it is 6512 - if one is to assume that this is the actual value too then it looks very high - around 7 per hour. But if you had an intermittent power connection to this HD that may explain it. Then we have attribute 187 - "reported incorrect" with "raw data" of 8391 and this in fact may be an "actual data" too as you have 8466 errors in the SMART log below (where only the last five are kept). Important here is the almost same numbers for "value", "worst" and "threshold" - so it looks like this HD is bad one way or another. This is from me for now. Link to comment
fitbrit Posted February 1, 2011 Author Share Posted February 1, 2011 Thanks very much, bcbgboy. Your name sounds familiar; was it RFD? My configuration isn't that unusual for some of the expert users here. The eSATA boxes allow oen to expand the array outside the limits of one's case. If I'd known I was going to get so much storage over time, I would have invested in a Norco 4224. The Seagate data is strange in that there has not been that many power cycles at all. Could this be the head parking issue with the C34 firmware? I said it was a new drive so under 1000 hours seems right to me too. It's such a pain to update the firmware too, but I guess I'll have to try it and see if it works out. The reason I chose the Seagate was because I was getting fed up with DOA WD drives, or them giving up the ghost a few weeks after installing them. I've had six or more WD drives go/arrive bad in the past year. In fact one of the RMA replacements was DOA too. I loved the Samsung drives I have, but now the F4 also seems to have firmware problems. My plan is to just replace the parity drive for now and rebuild parity. However, I just wanted to check with some experts that that was not a bad thing to do at this stage. Link to comment
SSD Posted February 1, 2011 Share Posted February 1, 2011 The kinds of errors you are seeing in your smart report are normally caused by some type of problem with the connection between the computer and the drive - not with the drive itself. Could be a bad or loose cable, bad controller port, bad drive cage, or even a bad/broken connector on the drive. Most commonly the problem is that the cable is not securely plugged in on one end or the other. Locking cables will sometimes fix these types of intermittent problems. If you don't have one, I'd recommend replacing the cable with a new one (or at least unplugging and replugging both ends) and trying again. Link to comment
fitbrit Posted February 1, 2011 Author Share Posted February 1, 2011 The kinds of errors you are seeing in your smart report are normally caused by some type of problem with the connection between the computer and the drive - not with the drive itself. Could be a bad or loose cable, bad controller port, bad drive cage, or even a bad/broken connector on the drive. Most commonly the problem is that the cable is not securely plugged in on one end or the other. Locking cables will sometimes fix these types of intermittent problems. If you don't have one, I'd recommend replacing the cable with a new one (or at least unplugging and replugging both ends) and trying again. Thanks very much, bjp999. It kind of makes sense: I've had parity problems with several drives now with my newish motherboard - a Supermicro C2SEA, like Tom uses in his servers. I had all-locking cables, but changed the one on my parity drive when my last two parity drives had problems. When I moved the drives to data ones (after successful pre-clears and also using them for a while in Windows without issue). I think it might be the motherboard port that's screwy if taking your post into account. Bummer, because this is an RMA'd board, with the first one being DOA. I have more internal ports available than I have slots in my main server cage, so I could move all the drives down one port and use one of the unused connections on my Supermicro 4x SAS board to take up the displaced configuration. Here I was thinking I wouldn't need to do drive rearrangement on this scale for a while! Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.