daniel.boone Posted December 12, 2011 Share Posted December 12, 2011 So I've tried a few different betas. Bv9 seemed to be best overall for this issue but even there the redball seems to happen from time to time. I'm thinking unRaid beta + 3TB Seagate Barracuda XT ST33000651AS = BAD Current setup: unRAID Server Pro v5.0-beta14 with a 12 drive mix of 1tb and 2tb drives GIGABYTE GA-EP45-UD3P 4GB Ram Core2 Duo SeaSonic X750 Gold 750W Patriot Xporter XT Boost 4GB Flash Drive Here is a piece of the log file. Log gets clobbered since error happens when least expected. Dec 12 04:40:01 Tower syslogd 1.4.1: restart. Dec 12 04:40:09 Tower emhttp: mdcmd: write: Input/output error Dec 12 04:40:09 Tower kernel: mdcmd (21234): spindown 0 Dec 12 04:40:09 Tower kernel: md: disk0: ATA_OP e0 ioctl error: -5 Dec 12 04:40:19 Tower emhttp: mdcmd: write: Input/output error Dec 12 04:40:19 Tower kernel: mdcmd (21235): spindown 0 Dec 12 04:40:19 Tower kernel: md: disk0: ATA_OP e0 ioctl error: -5 Dec 12 04:40:29 Tower emhttp: mdcmd: write: Input/output error Dec 12 04:40:29 Tower kernel: mdcmd (21236): spindown 0 I've replaced the power supply, replaced the sata cable, unplugged /replugged drive cables, memcheck completed for 24 hours and I've run the seagate software to test the drive. System passes with no errors. I can rebuild the parity and it will function fine for a while. Eventually issue will happen again. Is anyone running this drive and not having issues? Any recommendations on additional testing? TIA Quote Link to comment
mbryanr Posted December 12, 2011 Share Posted December 12, 2011 Post a smart report for the drive. Quote Link to comment
daniel.boone Posted December 12, 2011 Author Share Posted December 12, 2011 I'm going to need a little help. I tried from the command line and unmenu...same error. smartctl -t short -d ata /dev/sdb 2>&1 smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. I remove the -d command and add -T permissive and SMART says it starts but exits right away with no output. I tried smartctl -a -T permissive /dev/sdb smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Short INQUIRY response, skip product id SMART Health Status: OK Read defect list: asked for grown list but didn't get it Error Counter logging not supported scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 Device does not support Self Test logging I would run a long test after I get the command correct. Thanks Quote Link to comment
ohlwiler Posted December 12, 2011 Share Posted December 12, 2011 I would put a new drive in as parity until you get this straightened out. I certainly appears there is a problem with the drive, and you shouldn't be troubleshooting with your data at risk unless you have to. Quote Link to comment
daniel.boone Posted December 12, 2011 Author Share Posted December 12, 2011 I would put a new drive in as parity until you get this straightened out. I certainly appears there is a problem with the drive, and you shouldn't be troubleshooting with your data at risk unless you have to. The weird thing is when I run the seagate tools the drive passes without issue. I'm starting to think the drive just doesn't work well with unRaid. For a test I'm gonna move the parity to the supermicro hba and run a fresh parity build. With the 3s being so new I want to rule out a motherboard bios issue. Any suggestions are welcomed. If it's a defective drive I will need to prove it to seagate. Thanks Quote Link to comment
rs1932 Posted December 13, 2011 Share Posted December 13, 2011 Daniel, I have similar problems with the ST33000651AS. Had it as my parity drive and it kept redballing. I did notice that it happened only when it tried to come out of sleep. Running Beta 14 and I believe beta 11 as well. Running a full parity check using that drive there is no issue. Its only when it goes to sleep and has to wake up, something goes wrong. I removed it from the array did a complete 1 cycle preclear and there were no problems. I have since moved to beta 11 and am not sure whether to reintroduce this drive back into the array. RS Quote Link to comment
WeeboTech Posted December 13, 2011 Share Posted December 13, 2011 It could be a timing issue. if I remember correctly, Tom had to patch the sata driver to increase the timeout value. Quote Link to comment
jimwhite Posted December 13, 2011 Share Posted December 13, 2011 I am having similar issues with Samsung 2tb drives in Beta9... my server is made up of 16 of them on LSI controllers. Quote Link to comment
ohlwiler Posted December 13, 2011 Share Posted December 13, 2011 Do you have any need for v. 5.0 beyond support for larger drives? You could reduce the size of the drive in SeaTools and run 4.7 if it is indeed an issue with the beta. Another option, if it is indeed a timing issue with coming out of sleep, is to set the drive to never sleep and see if your issues stop. Quote Link to comment
daniel.boone Posted December 13, 2011 Author Share Posted December 13, 2011 Daniel, I have similar problems with the ST33000651AS. Had it as my parity drive and it kept redballing. I did notice that it happened only when it tried to come out of sleep. Running Beta 14 and I believe beta 11 as well. Running a full parity check using that drive there is no issue. Its only when it goes to sleep and has to wake up, something goes wrong. I removed it from the array did a complete 1 cycle preclear and there were no problems. I have since moved to beta 11 and am not sure whether to reintroduce this drive back into the array. RS My system doesn't sleep. I think in my case it's being introduced in spin up. It could be a timing issue. if I remember correctly, Tom had to patch the sata driver to increase the timeout value. I'll look for the patch. Do you have any need for v. 5.0 beyond support for larger drives? You could reduce the size of the drive in SeaTools and run 4.7 if it is indeed an issue with the beta. Another option, if it is indeed a timing issue with coming out of sleep, is to set the drive to never sleep and see if your issues stop. nice..I didn't think of that. I was about to buy a 2TB. At today's price that would hurt. Thanks everyone Quote Link to comment
daniel.boone Posted December 14, 2011 Author Share Posted December 14, 2011 So here's where I'm at I saved a drive order, renamed the 3 password files in the config dir on the flash drive, reloaded 4.7 and rebooted. After I reorganized the drives and determined the proper device ID for the 3TB drive I used hdparm to short stroke the 3TB drive down to 2.19TB using this command. The (x) gets replaced with the ID of the drive you want to reconfigure. In my case it is K. First I reset the MBR and partition table and then I change the usable space. dd if=/dev/zero count=200 of=/dev/sd(X) hdparm -N p4280000000 --yes-i-know-what-i-am-doing /dev/sd(X) I did try using Seatools but the settings wouldn't stick. After a proper shutdown, I read you must power cycle, I started my system and performed some light checking using unmenu and the hdparm info and smart status commands. Drive reported as 2TB and system responded as expected. I assigned the drive as my parity and started a sync. It's been running the better part of the night. For the present things seem to be heading in the right direction but I was able to perform a sync using the full 3TB and beta 5. I still have some testing to do before I can say this is resolved. Hopefully the redballing is just a beta issue that has yet to be addressed. Thanks for the suggestions. Quote Link to comment
daniel.boone Posted December 20, 2011 Author Share Posted December 20, 2011 Been a few days and things are looking good. I've added data, moved files, performed deletes and started a parity check with no correct. It's looking more and more like my issue might be Beta related. None of the hardware nor the physical connections have changed. I'm going to give it some more time and another parity check. Quote Link to comment
daniel.boone Posted December 31, 2011 Author Share Posted December 31, 2011 Here is a recent long smart report. I reviewed the parameters mentioned in the wiki and thing appear fine. It would be great if someone with more experience could confirm my findings. Thanks === START OF INFORMATION SECTION === Device Model: ST33000651AS Firmware Version: CC45 User Capacity: 2,199,021,142,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Dec 31 08:29:29 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 110 099 006 Pre-fail Always - 27876970 3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 140 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 064 060 030 Pre-fail Always - 3127611 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1108 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 21 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 093 093 000 Old_age Always - 7 190 Airflow_Temperature_Cel 0x0022 071 058 045 Old_age Always - 29 (Lifetime Min/Max 21/33) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 18 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 142 194 Temperature_Celsius 0x0022 029 042 000 Old_age Always - 29 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 014 008 000 Old_age Always - 27876970 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 270200687559155 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3268256658 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3461506823 SMART Error Log Version: 1 No Errors Logged Quote Link to comment
Joe L. Posted December 31, 2011 Share Posted December 31, 2011 I would question this: 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 18 It frequently indicates the disk heads were retracted in an emergency due to a power loss. (18 times ) Quote Link to comment
daniel.boone Posted December 31, 2011 Author Share Posted December 31, 2011 I would question this: 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 18 It frequently indicates the disk heads were retracted in an emergency due to a power loss. (18 times ) That doesn't surprise me. When I first picked up the drive/supermicro card I had a 550 watt supply which resulted in all kind of boot issues. I've upgraded to a 750 so power should not be a issue for the immediate future. I'll keep an eye on that metric just to be sure. With that in mind what would be a good testing plan? Should I wait a week or month, retest in short or long and how many more times? Thanks Quote Link to comment
Johnm Posted December 31, 2011 Share Posted December 31, 2011 I would question this: 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 18 It frequently indicates the disk heads were retracted in an emergency due to a power loss. (18 times ) everyone on of my hitachi 3TB drives gives me a 192 Power-Off_Retract_Count every to match my Load_Cycle_Count. It defiantly not an under power in my case. [b]Todays Redball[/b] ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 134 134 054 Pre-fail Offline - 110 3 Spin_Up_Time 0x0007 127 127 024 Pre-fail Always - 545 (Average 552) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 342 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail Offline - 32 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3750 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 355 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 355 194 Temperature_Celsius 0x0002 187 187 000 Old_age Always - 32 (Min/Max 16/40) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 MY hitachi's are suffering from this also in Beta12 I have half on an M1015 and half on a SASLP-MV8. they all exhibit the same behavior. The drives tend to redball when my mover kicks off. It is always the drive that is next in line for data. or should i say, it is the drive the mover wants to write to. I have 12 3TB drives in my system and most of them, if not all of them, have redballed some point by now. I have not tested in Beta 13/14 due to the LSI bug. If i set my drives to never sleep, I have no issues. I plan to roll back to 11 today and see if this stops. i am also going to physically relocate my server today in case it is some other thing like vibration in the rack it is in. If someone can tell me how to modify my mover script to "spin up all drives first", that might prevent the daily red ball issue. Or a cron job that kicks off a few minutes before the mover should have the same effect. I am trying to avoid keeping my drives spinning 24x7 Quote Link to comment
daniel.boone Posted January 1, 2012 Author Share Posted January 1, 2012 Joe wrote a spin drives script that can give you the basic routine. Should be easy to pull what you need and add it to cron just before mover starts. Look here http://lime-technology.com/forum/index.php?topic=1035.15 Spin up may help reduce the issue but I'm not so sure it will elimnate it. I don't have a cache drive so mover is not invovled and the drive still redballs. I too am running a Supermicro PCIe HBA. The M1015 is sitting on the side waiting for a working beta. I sure hope the issue gets resolved soon. I can't see buying a drive under 3TB at this point. Can anyone advise on Smart testing? I just don't want to run it so often that I stress the 3TB drive so early in the game. Quote Link to comment
ct1478 Posted August 11, 2012 Share Posted August 11, 2012 I hate to bring up an old thread, but was their any resolution for this problem? Quote Link to comment
Joe L. Posted August 11, 2012 Share Posted August 11, 2012 I hate to bring up an old thread, but was their any resolution for this problem? Yes, use the newest "5.0rc6-test2" beta. (it is the newest as of this post) Or newer... (once a newer release exists) The timeout for the drives was addressed in one of the intermediate betas. Joe L. Quote Link to comment
ct1478 Posted August 11, 2012 Share Posted August 11, 2012 Thanks,I am using the latest LSL version test2 and still have the problem, when I try to write directly to the drive after it has been spun down for 18-48 hr it redballs, I have no problem if the drive has been spun down for less than 18 hr. These drives are on LSI controllers, but I took them off and connected them to the sata ports on my X9SCM mb, with the same result. I am currently waiting for the 24-48 hr to pass to see if I can read data from this drive ok Quote Link to comment
ct1478 Posted August 12, 2012 Share Posted August 12, 2012 For anyone still interested in this topic, here are my results. I tried reading data from the disk, while it did not red ball, I could not get any data from the drive until I did a complete power down, so it looks like the only way to use these disks in unraid is to disable spindown or spinup the drives before trying to access them, which is not an option if trying to play a movie from this drive. I have 6 ST3000DM001 in the same array that work fine. Anybody know what a used ST33000651AS is worth? Quote Link to comment
BRiT Posted August 12, 2012 Share Posted August 12, 2012 ct1478, what bios do you have flashed on your LSI controllers? There were some users that had issues even in the earlier 5.0 beta series (beta 12 or so) that had really ancient firmware flashed on their LSI cards while the rest of us running newer/current firmware have not had issues. Sent from my Nexus 7 using Tapatalk 2 while sitting on the couch watching the 2012 Olympics and doing laundry but thinking about what to have for lunch. Quote Link to comment
ct1478 Posted August 12, 2012 Share Posted August 12, 2012 I have version 14 software on the LSI controller, I get the same results using the MB ports, I think the drive goes to sleep vs standby after several hours of not being used, and won't wake up until it gets the right command, which I don't think it is getting in Unraid If there was a utility I could use to disable the sleep mode on the drive I think it would work. I tried contacting Seagate and got no response at all Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.