3TB Seagate Barracuda XT ST33000651AS redballs frequently.


Recommended Posts

So I've tried a few different betas. Bv9 seemed to be best overall for this issue but even there the redball seems to happen from time to time. I'm thinking unRaid beta + 3TB Seagate Barracuda XT ST33000651AS = BAD

 

Current setup:

unRAID Server Pro v5.0-beta14 with a 12 drive mix of 1tb and 2tb drives

GIGABYTE GA-EP45-UD3P 4GB Ram Core2 Duo

SeaSonic X750 Gold 750W

Patriot Xporter XT Boost 4GB Flash Drive

 

Here is a piece of the log file. Log gets clobbered since error happens when least expected.

Dec 12 04:40:01 Tower syslogd 1.4.1: restart.

Dec 12 04:40:09 Tower emhttp: mdcmd: write: Input/output error

Dec 12 04:40:09 Tower kernel: mdcmd (21234): spindown 0

Dec 12 04:40:09 Tower kernel: md: disk0: ATA_OP e0 ioctl error: -5

Dec 12 04:40:19 Tower emhttp: mdcmd: write: Input/output error

Dec 12 04:40:19 Tower kernel: mdcmd (21235): spindown 0

Dec 12 04:40:19 Tower kernel: md: disk0: ATA_OP e0 ioctl error: -5

Dec 12 04:40:29 Tower emhttp: mdcmd: write: Input/output error

Dec 12 04:40:29 Tower kernel: mdcmd (21236): spindown 0

 

I've replaced the power supply, replaced the sata cable, unplugged /replugged drive cables, memcheck completed for 24 hours and I've run the seagate software to test the drive. System passes with no errors. I can rebuild the parity and it will function fine for a while. Eventually issue will happen again.

 

 

Is anyone running this drive and not having issues? Any recommendations on additional testing?

 

TIA

Link to comment

I'm going to need a little help. I tried from the command line and unmenu...same error.

 

smartctl -t short -d ata /dev/sdb 2>&1

 

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

I remove the -d command and add -T permissive and SMART says it starts but exits right away with no output.

 

 

I tried smartctl -a -T permissive /dev/sdb

 

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Short INQUIRY response, skip product id

SMART Health Status: OK

Read defect list: asked for grown list but didn't get it

 

Error Counter logging not supported

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

Device does not support Self Test logging

 

I would run a long test after I get the command correct.

 

Thanks

 

Link to comment

I would put a new drive in as parity until you get this straightened out. I certainly appears there is a problem with the drive, and you shouldn't be troubleshooting with your data at risk unless you have to.

 

The weird thing is when I run the seagate tools the drive passes without issue. I'm starting to think the drive just doesn't work well with unRaid.

 

For a test I'm gonna move the parity to the supermicro hba and run a fresh parity build. With the 3s being so new I want to rule out a motherboard bios issue.

 

Any suggestions are welcomed. If it's a defective drive I will need to prove it to seagate.

 

Thanks

Link to comment

Daniel,

 

I have similar problems with the ST33000651AS. Had it as my parity drive and it kept redballing. I did notice that it happened only when it tried to come out of sleep. Running Beta 14 and I believe beta 11 as well.

Running a full parity check using that drive there is no issue. Its only when it goes to sleep and has to wake up, something goes wrong.

 

I removed it from the array did a complete 1 cycle preclear and there were no problems. I have since moved to beta 11 and am not sure whether to reintroduce this drive back into the array.

 

RS

Link to comment

Do you have any need for v. 5.0 beyond support for larger drives? You could reduce the size of the drive in SeaTools and run 4.7 if it is indeed an issue with the beta. Another option, if it is indeed a timing issue with coming out of sleep, is to set the drive to never sleep and see if your issues stop.

Link to comment

Daniel,

 

I have similar problems with the ST33000651AS. Had it as my parity drive and it kept redballing. I did notice that it happened only when it tried to come out of sleep. Running Beta 14 and I believe beta 11 as well.

Running a full parity check using that drive there is no issue. Its only when it goes to sleep and has to wake up, something goes wrong.

 

I removed it from the array did a complete 1 cycle preclear and there were no problems. I have since moved to beta 11 and am not sure whether to reintroduce this drive back into the array.

 

RS

 

My system doesn't sleep. I think in my case it's being introduced in spin up.

 

It could be a timing issue. if I remember correctly, Tom had to patch the sata driver to increase the timeout value.

 

I'll look for the patch.

 

Do you have any need for v. 5.0 beyond support for larger drives? You could reduce the size of the drive in SeaTools and run 4.7 if it is indeed an issue with the beta. Another option, if it is indeed a timing issue with coming out of sleep, is to set the drive to never sleep and see if your issues stop.

 

nice..I didn't think of that. I was about to buy a 2TB. At today's price that would hurt.

 

 

Thanks everyone

Link to comment

So here's where I'm at

 

I saved a drive order, renamed the 3 password files in the config dir on the flash drive, reloaded 4.7 and rebooted. After I reorganized the drives and determined the proper device ID for the 3TB drive I used hdparm to short stroke the 3TB drive down to 2.19TB using this command. The (x) gets replaced with the ID of the drive you want to reconfigure. In my case it is K.

 

First I reset the MBR and partition table and then I change the usable space.

dd if=/dev/zero count=200 of=/dev/sd(X)

hdparm -N p4280000000 --yes-i-know-what-i-am-doing /dev/sd(X)

 

I did try using Seatools but the settings wouldn't stick. After a proper shutdown, I read you must power cycle, I started my system and performed some light checking using unmenu and the hdparm info and smart status commands. Drive reported as 2TB and system responded as expected. I assigned the drive as my parity and started a sync. It's been running the better part of the night.

 

2edv50k.jpg

 

For the present things seem to be heading in the right direction but I was able to perform a sync using the full 3TB and beta 5.

 

I still have some testing to do before I can say this is resolved. Hopefully the redballing is just a beta issue that has yet to be addressed.

 

Thanks for the suggestions.

Link to comment

Been a few days and things are looking good. I've added data, moved files, performed deletes and started a parity check with no correct.

 

It's looking more and more like my issue might be Beta related. None of the hardware nor the physical connections have changed. I'm going to give it some more time and another parity check.  ;)

 

bzlHUl.jpg

Link to comment
  • 2 weeks later...

Here is a recent long smart report. I reviewed the parameters mentioned in the wiki and thing appear fine. It would be great if someone with more experience could confirm my findings. Thanks

 

=== START OF INFORMATION SECTION ===

Device Model:    ST33000651AS

Firmware Version: CC45

User Capacity:    2,199,021,142,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Sat Dec 31 08:29:29 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 609) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x103f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  110  099  006    Pre-fail  Always      -      27876970

  3 Spin_Up_Time            0x0003  090  090  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      140

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  064  060  030    Pre-fail  Always      -      3127611

  9 Power_On_Hours          0x0032  099  099  000    Old_age  Always      -      1108

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      21

183 Runtime_Bad_Block      0x0032  099  099  000    Old_age  Always      -      1

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  093  093  000    Old_age  Always      -      7

190 Airflow_Temperature_Cel 0x0022  071  058  045    Old_age  Always      -      29 (Lifetime Min/Max 21/33)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      18

193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      142

194 Temperature_Celsius    0x0022  029  042  000    Old_age  Always      -      29 (0 19 0 0)

195 Hardware_ECC_Recovered  0x001a  014  008  000    Old_age  Always      -      27876970

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      270200687559155

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      3268256658

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      3461506823

 

SMART Error Log Version: 1

No Errors Logged

Link to comment

I would question this:

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       18

 

It frequently indicates the disk heads were retracted in an emergency due to a power loss.    (18 times ) 

 

That doesn't surprise me. When I first picked up the drive/supermicro card  I had a 550 watt supply which resulted in all kind of boot issues. I've upgraded to a 750 so power should not be a issue for the immediate future.

 

I'll keep an eye on that metric just to be sure. With that in mind what would be a good testing plan? Should I wait a week or month, retest in short or long and how many more times?

 

Thanks

Link to comment

I would question this:

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       18

 

It frequently indicates the disk heads were retracted in an emergency due to a power loss.    (18 times )  

 

everyone on of my hitachi 3TB drives gives me a 192 Power-Off_Retract_Count every to match my Load_Cycle_Count.

It defiantly not an under power in my case.

 

[b]Todays Redball[/b]

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
 2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       110
 3 Spin_Up_Time            0x0007   127   127   024    Pre-fail  Always       -       545 (Average 552)
 4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       342
 5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
 8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       32
 9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3750
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       355
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       355
194 Temperature_Celsius     0x0002   187   187   000    Old_age   Always       -       32 (Min/Max 16/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

MY hitachi's are suffering from this also in Beta12 I have half on an M1015 and half on a SASLP-MV8. they all exhibit the same behavior.

 

The drives tend to redball when my mover kicks off. It is always the drive that is next in line for data. or should i say, it is the drive the mover wants to write to. I have 12 3TB drives in my system and most of them, if not all of them, have redballed some point by now. I have not tested in Beta 13/14 due to the LSI bug. If i set my drives to never sleep, I have no issues.

 

I plan to roll back to 11 today and see if this stops. i am also going to physically relocate my server today in case it is some other thing like vibration in the rack it is in.

 

If someone can tell me how to modify my mover script to "spin up all drives first", that might  prevent the daily red ball issue.

Or a cron job that kicks off a few minutes before the mover should have the same effect.

 

I am trying to avoid keeping my drives spinning 24x7

 

 

Link to comment

Joe wrote a spin drives script that can give you the basic routine. Should be easy to pull what you need and add it to cron just before mover starts.  Look here http://lime-technology.com/forum/index.php?topic=1035.15

 

Spin up may help reduce the issue but I'm not so sure it will elimnate it. I don't have a cache drive so mover is not invovled and the drive still redballs.

 

I too am running a Supermicro PCIe HBA. The M1015 is sitting on the side waiting for a working beta. I sure hope the issue gets resolved soon. I can't see buying a drive under 3TB at this point.

 

Can anyone advise on Smart testing? I just don't want to run it so often that I stress the 3TB drive so early in the game.

Link to comment
  • 7 months later...

I hate to bring up an old thread, but was their any resolution for this problem?

Yes, use the newest "5.0rc6-test2" beta.  (it is the newest as of this post) Or newer... (once a newer release exists)

 

The timeout for the drives was addressed in one of the intermediate betas.

 

Joe L.

Link to comment

Thanks,I am using the latest LSL version test2 and still have the problem, when I try to write directly to the drive after it has been spun down for 18-48 hr it redballs, I have no problem if the drive has been spun down for less than 18 hr. These drives are on LSI controllers, but I took them off and connected them to the sata ports on my X9SCM mb, with the same result.

 

I am currently waiting for the 24-48 hr to pass to see if I can read data from this drive ok

 

 

Link to comment

For anyone still interested in this topic, here are my results.

 

I tried reading data from the disk, while it did not red ball, I could not get any data from the drive until I did a complete

power down, so it looks like the only way to use these disks in unraid is to disable spindown or spinup the drives

before trying to access them, which is not an option if trying to play a movie from this drive.

 

I have 6 ST3000DM001 in the same array that work fine.

 

Anybody know what a used ST33000651AS is worth?

Link to comment

ct1478, what bios do you have flashed on your LSI controllers? There were some users that had issues even in the earlier 5.0 beta series (beta 12 or so) that had really ancient firmware flashed on their LSI cards while the rest of us running newer/current firmware have not had issues.

 

Sent from my Nexus 7 using Tapatalk 2 while sitting on the couch watching the 2012 Olympics and doing laundry but thinking about what to have for lunch.

Link to comment

I have version 14 software on the LSI controller, I get the same results using the MB ports, I think the drive goes to sleep vs

standby after several hours of not being used, and won't wake up until it gets the right command, which I don't think it is

getting in Unraid

 

If there was a utility I could use to disable the sleep mode on the drive I think it would work.

I tried contacting Seagate and got no response at all

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.