magnumdoomguy Posted January 5, 2014 Share Posted January 5, 2014 Been running 5.0.3 for about a month now. Decided to run a parity check last night, and got my fourth red ball. The second and third are here: http://lime-technology.com/forum/index.php?topic=30972.msg279030#msg279030 Running smartctl -a -A /dev/sdq on the latest failed drive gives: A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. I tried running smartctl on a few of the other drives (which are otherwise functioning fine): smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-9YN164 Serial Number: Z1E1DSN7 LU WWN Device Id: 5 000c50 04e618f6f Firmware Version: CC4B User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Jan 5 13:38:41 2014 EST ==> WARNING: A firmware update for this drive may be available, see the following Seagate web pages: http://knowledge.seagate.com/articles/en_US/FAQ/207931en http://knowledge.seagate.com/articles/en_US/FAQ/223651en SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 228) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 226787296 3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 438 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 65645134 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7620 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 385 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 1 189 High_Fly_Writes 0x003a 052 052 000 Old_age Always - 48 190 Airflow_Temperature_Cel 0x0022 071 052 045 Old_age Always - 29 (Min/Max 25/38) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 68 193 Load_Cycle_Count 0x0032 008 008 000 Old_age Always - 185919 194 Temperature_Celsius 0x0022 029 048 000 Old_age Always - 29 (0 15 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 6410h+40m+47.547s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 177072791022439 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 57907602430615 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. While not all of my drives, I do appear to be getting high raw read error rates on a few drives. At this point I'm wondering if I should start suspecting the hardware. I have a Norco 4224 with the following: NORCO C-SFF8087-D SFF-8087 to SFF-8087 Internal Multilane SAS Cable (x3) http://www.newegg.ca/Product/Product.aspx?Item=N82E16816133034 SUPERMICRO AOC-SASLP-MV8 PCI-Express x4 Low Profile SAS RAID Controller http://www.newegg.ca/Product/Product.aspx?Item=N82E16816101358 SUPERMICRO MBD-X9SCM-O LGA 1155 Intel C204 Micro ATX Intel Xeon E3 Server Motherboard http://www.newegg.ca/Product/Product.aspx?Item=N82E16813182254 SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Card (x2) http://www.newegg.ca/Product/Product.aspx?Item=N82E16816101792 Any recommendations or suggestions? syslog.zip Quote Link to comment
magnumdoomguy Posted January 5, 2014 Author Share Posted January 5, 2014 Here's the third drive that red balled: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 166 157 021 Pre-fail Always - 6700 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 252 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 054 054 000 Old_age Always - 33850 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 172 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 24 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 248 194 Temperature_Celsius 0x0022 115 081 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 4 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 Have a UDMA CRC error of 4 on that one, but otherwise looks okay I think. And the second red balled drive: SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 190616536 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 729 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 12955017484 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7279 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 153 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 049 045 Old_age Always - 31 (2 8 45 30 0) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 138 193 Load_Cycle_Count 0x0032 023 023 000 Old_age Always - 155328 194 Temperature_Celsius 0x0022 031 051 000 Old_age Always - 31 (128 0 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 6655h+38m+34.307s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 28251950231 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 33782756058 None of the three culprits listed in the Unraid wiki (Reallocated sector count, current pending sector, or UDMA) but a very high Raw Read Error And here's a green ball drive with a high Raw Read Error Rate: SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 106 099 006 Pre-fail Always - 11741173 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1592 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 18178653 9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 23865 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 561 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 12885098500 189 High_Fly_Writes 0x003a 092 092 000 Old_age Always - 8 190 Airflow_Temperature_Cel 0x0022 073 036 045 Old_age Always In_the_past 27 (0 111 40 26 0) 194 Temperature_Celsius 0x0022 027 064 000 Old_age Always - 27 (0 17 0 0 0) 195 Hardware_ECC_Recovered 0x001a 051 021 000 Old_age Always - 11741173 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10823317607033 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2091198126 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1459779639 Quote Link to comment
magnumdoomguy Posted January 5, 2014 Author Share Posted January 5, 2014 My Apologies: CORSAIR HX Series HX850 850W ATX12V 2.3 / EPS12V 2.91 SLI Ready CrossFire Ready 80 PLUS GOLD Certified Modular Active PFC Power Supply New 4th Gen CPU Certified Haswell Ready http://www.newegg.ca/Product/Product.aspx?Item=N82E16817139011 21 drives in the array including parity and cache (ssd). Do you think the power supply might be the culprit? Quote Link to comment
dgaschk Posted January 5, 2014 Share Posted January 5, 2014 Check the power cabling, especially any splitters. Quote Link to comment
magnumdoomguy Posted January 5, 2014 Author Share Posted January 5, 2014 Not using any power splitters. Will power down now and check the cables though. Quote Link to comment
magnumdoomguy Posted January 5, 2014 Author Share Posted January 5, 2014 Just checked the cables, everything seems pretty snug. Now that it has rebooted smartctl works on the latest red balled drive. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 181 179 021 Pre-fail Always - 5908 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1681 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4377 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18 193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 7551 194 Temperature_Celsius 0x0022 119 076 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 Looks fine to me. Going to try a long test on it now. Quote Link to comment
magnumdoomguy Posted January 5, 2014 Author Share Posted January 5, 2014 Okay, ran smartctl on every drive and have attached all the ones that came up with errors (11!). I compared them with the main screen of Unraid to look for patterns. All the problem drives save one are Seagates (of various sizes and bought at different times). The exception is a Samsung 1TB. And not only are they all Seagates (save one) they are also the entirety of Seagate drives that I have. All the problem drives save one are on the SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Cards. The other controller card only has 3 drives on it though and two are Western Digitals. The one Seagate on it is problematic. Can anyone help? I'm at a loss. smartctl.zip Quote Link to comment
MyKroFt Posted January 6, 2014 Share Posted January 6, 2014 I have read and following on mine - only use 1 power connector per backplane - 2nd is for redundant PS I have read..... Myk Quote Link to comment
bobbintb Posted January 6, 2014 Share Posted January 6, 2014 lol, i recently had a similar situation. i was getting errors on drives randomly. id run a smart test it would fail, but then pass in a later instance. i figured it wasnt the drives themselves as it was many drives failing and it would be very random and inconsistent. i tried all new cables since i have dozens of spare sata cables. i tried bypassing the backplane, new psu. in the end the controller on the motherboard was bad. i have actually seem that happen to a customer before and i was glad i caught it. i was troubleshooting his pc and all evidence pointed to a failing hard drive. on a hunch, i checked his motherboard and there were blown capacitors on the motherboard right by the ide port. since you have multiple controllers that might not be the case for you. i can only suggest ruling things out one by one by order of likelihood. the drives, the cables, the backplane, the psu, ram, mobo, etc. Quote Link to comment
DoeBoye Posted January 6, 2014 Share Posted January 6, 2014 I'm a fan of doubling up the PS connections on the backplanes. I know they are supposed to be for a redundant PS, but in my own personal experience, since I've done it, random red ball issues have become non-existent. Other variables have also changed over time, but I personally feel that the 4224 seems to be more stable with 2 PS connections per backplane (Using separate PS wires if possible). Note: I had to buy some sata-to-molex adapters because I didn't have enough molex connections HTH Quote Link to comment
magnumdoomguy Posted January 6, 2014 Author Share Posted January 6, 2014 I have a new Norco with the single power back planes. Someone else on this forum posted a pic so thankfully I don't have to go through the trouble: http://lime-technology.com/forum/index.php?topic=29274.0 So single power connections is the only option. No splitters, and each is on it's own cable (I read somewhere that that was preferable if possible). I googled Seagate and SMARTCTL and found that the numbers they use for Raw Read Error are really messed up -- it's a bit of math to convert the raw value into something meaningful to a human. So it turns out all the raw read errors on the Seagates are unimportant (after doing the math). So I'm down to just the four (ahem, now 5) red balls. I decided to wipe the array config last night and set it up the same again, told it to trust the parity, started up, then ran a parity check. Went well for a while then drive 11 just flat out disappeared. Got a redball and that drive wasn't even listed anymore. I should have captured the syslog at that point, but didn't unfortunately. I noticed the temps were quite high (most of the drives in the sixties and my two 1.5TB drives [which have always run hotter than the others] were in the seventies). I powered down, opened the window (which is pretty close to the NAS) and let the Canadian winter bring it down to a cooler temp for a while. After cooling down, did the same procedure again, and it's nearly down now (75% in, with only the 4TB drives less so the speed has increased dramatically). The red ball issues have happened during parity checks or a rebuild, so I think I'm just plain getting too hot. I have the optional 120mm fan plate for the Norco with 3 Noctua fans on it. I'm not using the low noise adapters (which limit the speed to 1300RPM max). I did put a low noise setting on the BIOS though. I'm figuring since the drives are before the intake fans and the motherboard setting is using the system ambient temp to decide fan speeds, it's going lower than the hard drives need. I'll try (once the parity check completes), hooking up a screen and going back into the bios and choosing a more aggressive fan setting (bye bye sleep -- this is in my bedroom). Quote Link to comment
DoeBoye Posted January 6, 2014 Share Posted January 6, 2014 I have a new Norco with the single power back planes. Someone else on this forum posted a pic so thankfully I don't have to go through the trouble: http://lime-technology.com/forum/index.php?topic=29274.0 oh! Shows I haven't been keeping up with things. I had no idea Norco went to one molex plug! As far as cooling, those temps are definitely too hot. That certainly might explain your problems! Have you tried disabling the bios control and use one of the fan scripts floating around the forums that control fan speed based on drive temps rather than ambient? With the same case, 14 data drives and an ambient temp of 22 degrees Celsius, my drives at idle generally live around 33 degrees. Even when parity check is running, temps in the winter rarely break 40. How are your rear fans working? I replaced mine (there's a thread discussing the model around here), so they are much quieter than the original ones, and I'm using a 120mm fan wall. Are the rear fans set to push air out of the case? With the front fan wall pulling air across the drives, and the rear fans ejecting air, the 4224 generally stays reasonably cool... Fan noise is definitely noticeable on full, but almost inaudible when the server is idle.... Quote Link to comment
magnumdoomguy Posted January 6, 2014 Author Share Posted January 6, 2014 That script sounds awesome... I'll hunt for it. Definitely preferable to cranking the fans 24/7. I also replaced the rear fans (and yes, they're definitely exhausting )... Also with Noctuas (love Noctua -- so powerful, yet so quiet). Thanks for the tip on the script. That would be a great feature to make standard. Quote Link to comment
DoeBoye Posted January 7, 2014 Share Posted January 7, 2014 That script sounds awesome... I'll hunt for it. Definitely preferable to cranking the fans 24/7. I also replaced the rear fans (and yes, they're definitely exhausting )... Also with Noctuas (love Noctua -- so powerful, yet so quiet). Thanks for the tip on the script. That would be a great feature to make standard. No worries! I actually use two copies. One controls the header for my case fans, the other controls the header for my cpu fan. I've attached them. I like them to run slightly differently, and figured two scripts was just easier. For the record, the script I am using is the one with the following versions: # Version 1.0 Authored by Aiden # Version 1.1 Modified by Dan Stroot to run in a loop. Does not require the user # to add this to cron - just start it in your go file. # Version 1.2 Modified by Guzzi - removed -d ata to work on sas controllers # I messed around a bit with the last version, and don't remember if I returned it back to 'stock', so you may want to google around and find the original I used, in case I did make changes and they don't like your system . I call the scripts from my go file like so: ### Fan Speed Control ### /boot/scripts/fan_speed.sh # /boot/scripts/fan_speed_cpu.sh # ### END - Fan Speed Control ### Cheers, DB. fan_speed.zip Quote Link to comment
SnickySnacks Posted January 7, 2014 Share Posted January 7, 2014 I'll throw in my two cents here, too. Your drives are definitely running way too hot. As I recall mine don't get over 40 when running a parity check either. If yours are going north of 70 you have something off. You say your rear fans are blowing out of the case, are the 120s in the middle blowing to the rear? Do you have your server in a closet or something? I have basically the same setup and I would panic if I saw a drive top 50, much less 70. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.