Been running UNRAID for about one month now, fourth red ball

magnumdoomguy · January 5, 2014

Been running 5.0.3 for about a month now. Decided to run a parity check last night, and got my fourth red ball. The second and third are here: http://lime-technology.com/forum/index.php?topic=30972.msg279030#msg279030

Running smartctl -a -A /dev/sdq on the latest failed drive gives:

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

I tried running smartctl on a few of the other drives (which are otherwise functioning fine):

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-9YN164
Serial Number:    Z1E1DSN7
LU WWN Device Id: 5 000c50 04e618f6f
Firmware Version: CC4B
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan  5 13:38:41 2014 EST

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 228) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       226787296
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       438
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       65645134
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7620
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       385
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       0 0 1
189 High_Fly_Writes         0x003a   052   052   000    Old_age   Always       -       48
190 Airflow_Temperature_Cel 0x0022   071   052   045    Old_age   Always       -       29 (Min/Max 25/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       68
193 Load_Cycle_Count        0x0032   008   008   000    Old_age   Always       -       185919
194 Temperature_Celsius     0x0022   029   048   000    Old_age   Always       -       29 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6410h+40m+47.547s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       177072791022439
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       57907602430615

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

While not all of my drives, I do appear to be getting high raw read error rates on a few drives.

At this point I'm wondering if I should start suspecting the hardware.

I have a Norco 4224 with the following:

NORCO C-SFF8087-D SFF-8087 to SFF-8087 Internal Multilane SAS Cable (x3) http://www.newegg.ca/Product/Product.aspx?Item=N82E16816133034

SUPERMICRO AOC-SASLP-MV8 PCI-Express x4 Low Profile SAS RAID Controller http://www.newegg.ca/Product/Product.aspx?Item=N82E16816101358

SUPERMICRO MBD-X9SCM-O LGA 1155 Intel C204 Micro ATX Intel Xeon E3 Server Motherboard http://www.newegg.ca/Product/Product.aspx?Item=N82E16813182254

SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Card (x2) http://www.newegg.ca/Product/Product.aspx?Item=N82E16816101792

Any recommendations or suggestions?

syslog.zip

magnumdoomguy · January 5, 2014

Here's the third drive that red balled:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   166   157   021    Pre-fail  Always       -       6700
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       252
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   054   054   000    Old_age   Always       -       33850
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       172
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       248
194 Temperature_Celsius     0x0022   115   081   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       4
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Have a UDMA CRC error of 4 on that one, but otherwise looks okay I think.

And the second red balled drive:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       190616536
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       729
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       12955017484
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7279
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       153
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   049   045    Old_age   Always       -       31 (2 8 45 30 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       138
193 Load_Cycle_Count        0x0032   023   023   000    Old_age   Always       -       155328
194 Temperature_Celsius     0x0022   031   051   000    Old_age   Always       -       31 (128 0 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6655h+38m+34.307s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       28251950231
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       33782756058

None of the three culprits listed in the Unraid wiki (Reallocated sector count, current pending sector, or UDMA) but a very high Raw Read Error

And here's a green ball drive with a high Raw Read Error Rate:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   106   099   006    Pre-fail  Always       -       11741173
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1592
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       18178653
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       23865
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       561
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       12885098500
189 High_Fly_Writes         0x003a   092   092   000    Old_age   Always       -       8
190 Airflow_Temperature_Cel 0x0022   073   036   045    Old_age   Always   In_the_past 27 (0 111 40 26 0)
194 Temperature_Celsius     0x0022   027   064   000    Old_age   Always       -       27 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   051   021   000    Old_age   Always       -       11741173
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10823317607033
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2091198126
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1459779639

dgaschk · January 5, 2014

Which model PSU?

magnumdoomguy · January 5, 2014

My Apologies:

CORSAIR HX Series HX850 850W ATX12V 2.3 / EPS12V 2.91 SLI Ready CrossFire Ready 80 PLUS GOLD Certified Modular Active PFC Power Supply New 4th Gen CPU Certified Haswell Ready http://www.newegg.ca/Product/Product.aspx?Item=N82E16817139011

21 drives in the array including parity and cache (ssd). Do you think the power supply might be the culprit?

dgaschk · January 5, 2014

Check the power cabling, especially any splitters.

magnumdoomguy · January 5, 2014

Not using any power splitters.

Will power down now and check the cables though.

magnumdoomguy · January 5, 2014

Just checked the cables, everything seems pretty snug.

Now that it has rebooted smartctl works on the latest red balled drive.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   179   021    Pre-fail  Always       -       5908
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1681
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4377
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       50
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       18
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7551
194 Temperature_Celsius     0x0022   119   076   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Looks fine to me. Going to try a long test on it now.

magnumdoomguy · January 5, 2014

Okay, ran smartctl on every drive and have attached all the ones that came up with errors (11!).

I compared them with the main screen of Unraid to look for patterns.

All the problem drives save one are Seagates (of various sizes and bought at different times). The exception is a Samsung 1TB. And not only are they all Seagates (save one) they are also the entirety of Seagate drives that I have.

All the problem drives save one are on the SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Cards. The other controller card only has 3 drives on it though and two are Western Digitals. The one Seagate on it is problematic.

Can anyone help? I'm at a loss.

smartctl.zip

MyKroFt · January 6, 2014

I have read and following on mine - only use 1 power connector per backplane - 2nd is for redundant PS I have read.....

Myk

bobbintb · January 6, 2014

lol, i recently had a similar situation. i was getting errors on drives randomly. id run a smart test it would fail, but then pass in a later instance. i figured it wasnt the drives themselves as it was many drives failing and it would be very random and inconsistent. i tried all new cables since i have dozens of spare sata cables. i tried bypassing the backplane, new psu. in the end the controller on the motherboard was bad.

i have actually seem that happen to a customer before and i was glad i caught it. i was troubleshooting his pc and all evidence pointed to a failing hard drive. on a hunch, i checked his motherboard and there were blown capacitors on the motherboard right by the ide port.

since you have multiple controllers that might not be the case for you. i can only suggest ruling things out one by one by order of likelihood.

the drives, the cables, the backplane, the psu, ram, mobo, etc.

DoeBoye · January 6, 2014

I'm a fan of doubling up the PS connections on the backplanes.

I know they are supposed to be for a redundant PS, but in my own personal experience, since I've done it, random red ball issues have become non-existent.

Other variables have also changed over time, but I personally feel that the 4224 seems to be more stable with 2 PS connections per backplane (Using separate PS wires if possible).

Note: I had to buy some sata-to-molex adapters because I didn't have enough molex connections

HTH

magnumdoomguy · January 6, 2014

I have a new Norco with the single power back planes. Someone else on this forum posted a pic so thankfully I don't have to go through the trouble: http://lime-technology.com/forum/index.php?topic=29274.0

So single power connections is the only option. No splitters, and each is on it's own cable (I read somewhere that that was preferable if possible).

I googled Seagate and SMARTCTL and found that the numbers they use for Raw Read Error are really messed up -- it's a bit of math to convert the raw value into something meaningful to a human. So it turns out all the raw read errors on the Seagates are unimportant (after doing the math). So I'm down to just the four (ahem, now 5) red balls.

I decided to wipe the array config last night and set it up the same again, told it to trust the parity, started up, then ran a parity check. Went well for a while then drive 11 just flat out disappeared. Got a redball and that drive wasn't even listed anymore. I should have captured the syslog at that point, but didn't unfortunately. I noticed the temps were quite high (most of the drives in the sixties and my two 1.5TB drives [which have always run hotter than the others] were in the seventies). I powered down, opened the window (which is pretty close to the NAS) and let the Canadian winter bring it down to a cooler temp for a while.

After cooling down, did the same procedure again, and it's nearly down now (75% in, with only the 4TB drives less so the speed has increased dramatically). The red ball issues have happened during parity checks or a rebuild, so I think I'm just plain getting too hot. I have the optional 120mm fan plate for the Norco with 3 Noctua fans on it. I'm not using the low noise adapters (which limit the speed to 1300RPM max). I did put a low noise setting on the BIOS though. I'm figuring since the drives are before the intake fans and the motherboard setting is using the system ambient temp to decide fan speeds, it's going lower than the hard drives need. I'll try (once the parity check completes), hooking up a screen and going back into the bios and choosing a more aggressive fan setting (bye bye sleep -- this is in my bedroom).

DoeBoye · January 6, 2014

I have a new Norco with the single power back planes. Someone else on this forum posted a pic so thankfully I don't have to go through the trouble: http://lime-technology.com/forum/index.php?topic=29274.0

oh! Shows I haven't been keeping up with things. I had no idea Norco went to one molex plug!

As far as cooling, those temps are definitely too hot. That certainly might explain your problems! Have you tried disabling the bios control and use one of the fan scripts floating around the forums that control fan speed based on drive temps rather than ambient? With the same case, 14 data drives and an ambient temp of 22 degrees Celsius, my drives at idle generally live around 33 degrees. Even when parity check is running, temps in the winter rarely break 40.

How are your rear fans working? I replaced mine (there's a thread discussing the model around here), so they are much quieter than the original ones, and I'm using a 120mm fan wall. Are the rear fans set to push air out of the case? With the front fan wall pulling air across the drives, and the rear fans ejecting air, the 4224 generally stays reasonably cool...

Fan noise is definitely noticeable on full, but almost inaudible when the server is idle....

magnumdoomguy · January 6, 2014

That script sounds awesome... I'll hunt for it. Definitely preferable to cranking the fans 24/7.

I also replaced the rear fans (and yes, they're definitely exhausting )... Also with Noctuas (love Noctua -- so powerful, yet so quiet).

Thanks for the tip on the script. That would be a great feature to make standard.

DoeBoye · January 7, 2014

That script sounds awesome... I'll hunt for it. Definitely preferable to cranking the fans 24/7.

I also replaced the rear fans (and yes, they're definitely exhausting )... Also with Noctuas (love Noctua -- so powerful, yet so quiet).

Thanks for the tip on the script. That would be a great feature to make standard.

No worries! I actually use two copies. One controls the header for my case fans, the other controls the header for my cpu fan. I've attached them. I like them to run slightly differently, and figured two scripts was just easier.

For the record, the script I am using is the one with the following versions:

# Version 1.0   Authored by Aiden
# Version 1.1	Modified by Dan Stroot to run in a loop. Does not require the user 
#               to add this to cron - just start it in your go file.
# Version 1.2	Modified by Guzzi - removed -d ata to work on sas controllers
#

I messed around a bit with the last version, and don't remember if I returned it back to 'stock', so you may want to google around and find the original I used, in case I did make changes and they don't like your system .

I call the scripts from my go file like so:

### Fan Speed Control ###
/boot/scripts/fan_speed.sh  # 
/boot/scripts/fan_speed_cpu.sh  # 
### END - Fan Speed Control ###

Cheers,

DB.

fan_speed.zip

SnickySnacks · January 7, 2014

I'll throw in my two cents here, too. Your drives are definitely running way too hot. As I recall mine don't get over 40 when running a parity check either. If yours are going north of 70 you have something off.

You say your rear fans are blowing out of the case, are the 120s in the middle blowing to the rear? Do you have your server in a closet or something?

I have basically the same setup and I would panic if I saw a drive top 50, much less 70.

Been running UNRAID for about one month now, fourth red ball

Recommended Posts

magnumdoomguy

Link to comment

magnumdoomguy

Link to comment

dgaschk

Link to comment

magnumdoomguy

Link to comment

dgaschk

Link to comment

magnumdoomguy

Link to comment

magnumdoomguy

Link to comment

magnumdoomguy

Link to comment

MyKroFt

Link to comment

bobbintb

Link to comment

DoeBoye

Link to comment

magnumdoomguy

Link to comment

DoeBoye

Link to comment

magnumdoomguy

Link to comment

DoeBoye

Link to comment

SnickySnacks

Link to comment

Join the conversation