April 4, 201115 yr Hi, So today i had a Red dot next to one of my drives. The disk "failed" in the middle of streaming video, so i knew it had gone straight away. I did a quick search and found a post where Joe recommended a quick test before replacing the disk, by un-assigning the disk, rebooting and reassigning the disk, letting the array rebuild the data on it. So all seems fine and the rebuild has 400 minutes or so left to run, but heres my question. I inadvertantly hit play on the video that was streaming earlier when the disk failed (yes i checked and this video was on the failed disk) and to my surprise, it played!!! even though the disk was in the middle of being rebuilt! Am i just realising one of the features of unraid?
April 4, 201115 yr Am i just realising one of the features of unraid? Yep. You can still read data on the "missing" drive, including during rebuilds...
April 5, 201115 yr Am i just realising one of the features of unraid? Yep. You can still read data on the "missing" drive, including during rebuilds... And, you can still write to the "missing" drive too. In the early days of my unRAID server's life I simulated a failed drive and managed to play 4 different movies from the "missing" drive, while it was being rebuilt, to 4 different media players on my home LAN. The re-construction slowed a bit for each additional movies played, but it worked just fine.
April 5, 201115 yr And, you can still write to the "missing" drive too. Didn't know that, and that's nifty too... Makes sense as parity is being recalculated. Can you write if the drive isn't being reconstructed? The re-construction slowed a bit for each additional movies played, but it worked just fine. Is there any prioritization of different "types" of I/O? Or is it just evenly split between any ongoing IO?
April 5, 201115 yr And, you can still write to the "missing" drive too. Didn't know that, and that's nifty too... Makes sense as parity is being recalculated. Can you write if the drive isn't being reconstructed? Yes. You might not even know the drive has failed unless you check or unRAID sends an e-mail. The re-construction slowed a bit for each additional movies played, but it worked just fine. Is there any prioritization of different "types" of I/O? Or is it just evenly split between any ongoing IO? IDK
April 5, 201115 yr Can you write if the drive isn't being reconstructed? Yes. As mentioned, unless you check the unRAID status screen, or install an add-on that alerts you proactively, you might not notice a disk had failed.
April 5, 201115 yr Author Hi again, so for the second time in two days the server has dropped off the network, and will not respond to the powerdown command, which neccessitates me to do a hard reboot, then a parity check. Very annoying and only just started to happen. I attach the log, which i would be grateful if someone could look at and let me know if there is anything obvious? Thanks syslog.txt
April 5, 201115 yr 1. Have you run memtest on the system? 2. Can we please get a complete hardware breakdown 3. Did you capture the above syslog before a reboot? 4. How did you get the syslog if the server when off the network?
April 5, 201115 yr Perform a short and long SMART test on your Seagete 1.5TB drive - ST31500341AS with serial number 9VS01LLN and post the results here it is ata4 = md8 = sde
April 5, 201115 yr Author Ok so the Smart tests are underway and I will post them here when they are done. Using the following equipment (sorry for the formatting): Qty Product Description QuickFind Status Delivery Cost (ex VAT) Line cost 1 x EXTRA VALUE 4GB (2x2GB) DDR3 1333MHz Memory Kit 1.5V CL9 192049 Warehouse Invoiced Track £26.04 £26.04 1 x Corsair 650W TX Series PSU - 120mm Fan, 80+% Efficiency, Single +12V Rail 135514 Warehouse Invoiced Track £53.15 £53.15 6 x Plexus Serial ATA 2.0 to Right Angle SATA 7-pin Cable (Red) 61cm / 24" 151962 Warehouse Invoiced Track £1.60 £9.60 1 x StarTech.com Panel Mount USB Cable USB A to Motherboard Header Cable 173395 Warehouse Invoiced Track £3.21 £3.21 7 x Western Digital WD20EARS 2TB Hard Drive SATAII 64MB Cache - OEM Caviar Green 183971 Warehouse Invoiced Track £55.12 £385.84 1 x LG W1946S-BF TFT LCD 18.5" VGA Monitor 194601 Warehouse Invoiced Track £57.90 £57.90 1 x Asus P7H55-M LX H55 Socket 1156 DVI VGA Out 8 Channel Audio MATX Motherboard 235728 Warehouse Invoiced Track £51.55 £51.55 1 x Intel Core i3 540 3.06GHz Socket 1156 4MB L3 Cache Retail Box Processor 185939 Warehouse Invoiced Track £74.95 £74.95 1 x Belkin 5 Port Gigabit Switch 126574 Warehouse Invoiced Track £20.74 £20.74 1 x Kingston DataTraveler G3 2gb USB Flash Drive 231481 Warehouse Invoiced Track £3.72 £3.72 1 x Antec 1200 Twelve Hundred Full Tower Case 143852 Warehouse Invoiced Track £108.02 £108.02 Plus on top of that i am using 2 supermicro SAS-MV8 pci-ex Boards (8 ports each) and the additional HD's as noted in the log. I am only using one of the sticks of memory (2GB) at the moment due to compatability issues with the MOBO in dual channel mode. The memory is going to be replaced with some Corsair dominator memory in a couple of weeks. It is worth bearing in mind though that this only started yesterday after 6 weeks of uptime.
April 5, 201115 yr Author Ok so both Short and long tests done. Results below: Statistics for /dev/sde ST31500341AS_9VS01LLN smartctl -a -d ata /dev/sde smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST31500341AS Serial Number: 9VS01LLN Firmware Version: SD37 User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Tue Apr 5 23:11:23 2011 BST ==> WARNING: There are known problems with these drives, see the following Seagate web pages: http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103b) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 170205846 3 Spin_Up_Time 0x0003 089 089 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 098 098 020 Old_age Always - 2110 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51666357 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13966 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 82 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 141 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 095 000 Old_age Always - 137441312824 189 High_Fly_Writes 0x003a 056 056 000 Old_age Always - 44 190 Airflow_Temperature_Cel 0x0022 059 019 045 Old_age Always In_the_past 41 (255 255 42 35) 194 Temperature_Celsius 0x0022 041 081 000 Old_age Always - 41 (0 17 0 0) 195 Hardware_ECC_Recovered 0x001a 034 028 000 Old_age Always - 170205846 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 13966 - # 2 Short offline Completed without error 00% 13961 - # 3 Short offline Completed without error 00% 12327 - # 4 Short offline Completed without error 00% 5891 - # 5 Short offline Completed without error 00% 5455 - # 6 Short offline Completed without error 00% 2162 - # 7 Short offline Completed without error 00% 2162 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
April 6, 201115 yr I do not use Seagate HDs but I believe you have (had) a problem with this particular HD in the past. How do you cool your HDs - do you use 5in3? Did you have any problems with the fans on these in the past It appears that this HD has been "cooked" in the past: 190 Airflow_Temperature_Cel 0x0022 059 019 045 Old_age Always In_the_past 41 (255 255 42 35) The "worst" value of 19 is significantly lower than the "threshold" value of 45 and the flag has been set "in the past" I believe that Seagate uses (100 minus actual temperature in C) for this attribute and this means that your hard drive has been subjected to a temperature of 81 deg. C in the past (Seagate itself limits the threshold to 55 deg. C - it is 100 minus 45 threshold). However it did past the tests so I really do not know what to think. It could be a malfunctioning temperature sensor circuitry but I suggest to monitor this HD daily. Perhaps other users using Seagate HDs can comment on this one.
April 6, 201115 yr Author Yes, all drives except the WD20EARS were pulled from USB drive casings. Those two 1.5TB drives had no fans in the enclosures so no wonder they cooked. But this doesnt explain why the whole server dropped off the network. does anybody see anything in the log?
April 6, 201115 yr Author Ok, so 3 days running and now 3 times the server has dropped off the network. Shares do not work, video freezes if i am in the middle of watching something, and yet the server today was accessible by telnet, and fully usuable at the console. I had no choice but to reboot again. This is getting VERY annoying. Cant anybody from Lime help here? I dont understand the logs but you must be able to?
April 6, 201115 yr Ok, so 3 days running and now 3 times the server has dropped off the network. Shares do not work, video freezes if i am in the middle of watching something, and yet the server today was accessible by telnet, and fully usuable at the console. I had no choice but to reboot again. This is getting VERY annoying. Cant anybody from Lime help here? I dont understand the logs but you must be able to? You should have grabbed the log when the shares went down but it was still available via telnet. That would have told us a lot more. The original syslog seems to reveal some issues with the Supermicro card you probably have installed. Though your previous post about your hardware is hard to follow as you refer to the SAS-MV8 card but then say it is pci-ex (which does not exist). I am assuming you mean the SASLP-MV8.
April 6, 201115 yr Author Ok, so 3 days running and now 3 times the server has dropped off the network. Shares do not work, video freezes if i am in the middle of watching something, and yet the server today was accessible by telnet, and fully usuable at the console. I had no choice but to reboot again. This is getting VERY annoying. Cant anybody from Lime help here? I dont understand the logs but you must be able to? You should have grabbed the log when the shares went down but it was still available via telnet. That would have told us a lot more. The original syslog seems to reveal some issues with the Supermicro card you probably have installed. Though your previous post about your hardware is hard to follow as you refer to the SAS-MV8 card but then say it is pci-ex (which does not exist). I am assuming you mean the SASLP-MV8. Yes thats right, its (2 actually) PCI-Express x4 cards, thats what i was trying to say. It is the SASLP card. What are the issues? I wouldnt have the first clue about linux logs. Also, i notice that someone else has reported the same issue here today: http://lime-technology.com/forum/index.php?topic=12192.0
April 6, 201115 yr My suggestions from earlier still stand. running a memtest would help and make sure that the RAM is good. After that I would disable any addon's and see what happens with regards to the issue you are seeing. Are you running any addon's?
April 6, 201115 yr Author I have done a memtest overnight. No issues found. The only addons I use are unmenu, smtp and email notification. All these have been running for weeks.
April 7, 201115 yr Author I have done another memtest over night, no issues highlighted. I should point out that the log i posted earlier was after the failiure of the shares for the 2nd time. so to recap: - Shares stop working (disk and user) - web access to unmenu and unraid stop - telnet still functioning - scripts stop working properly (powerdown and reboot get halfway through and then just stop) with the issues pointed out earlier with the supermicro card, is this looking to be a controller issue???
April 7, 201115 yr why do you have 2 of the SASLP cards in the system? It looks like you have 12(ish) drives in the array and if that is the case you probably dont need the extra card in there. From what I could see at the end of the log something related to the sas driver (for the SASLP card) is timing out and that is probably what is causing your problems. Not quite sure. You are more than likely having some sort of hardware issue, though I don't know what it could exactly be. Do some google searching of the errors at the end of the log and see what you find.
April 7, 201115 yr I suggest when the problem happens to telnet into the unRaid server and running the following command: tail -n 50 /var/log/syslog this will give you the last 50 lines of the syslog. Then I would check how much free memory you have left by running: free I suspect your server is crashing due to log explosion, near the end of your log, you are getting a lot of drive IO errors, ~12 entries every 30s! I had a similar problem, and I isolated it to a bad drive, but in my case the drive was exhibiting this problem when I was trying to preclear it... in your case, might be more difficult to trouble shoot if you have data on the drives and you can't easily pull out each drive. If your memory free is really low, then that could be causing the server crashes and check for another post from me about log explosion and what I did to try to avoid server crashes due to log explosion. I would also suggest you try to reseat all your drive cables and also the SAS cards to make sure it's not a bad connection. Then try the obvious such as the below: hdparm -tT /dev/sdX where X are your drive letters, this will perform a speedtest, and really I want to see if you can still access the HDs from the server itself Then run the following: ethtool eth0 and ifconfig eth0 This will spit out the results of your network connection. I had a problem where the cable had a bad connection, I could telnet in, but transfers to/from the box was slow @ 170kb/s... Lastly, if you remember which video file you were playing, you could try to find which drive it's streaming from, it could that drive thats failing. I'm not that familiar with unRAID yet so not sure if you could pull that drive and just try to see if the server stops crashing without it freaking out that a drive is missing...
April 7, 201115 yr 1. I will suggest to check and update your motherboard BIOS to the latest revision available. Then load the default values, change the SATA drives to AHCI mode and then disable any unused board features (serial and parallel ports, audio, firewire, floppy drive controller, IDE controller if you do not plan to use any of the older PATA drives). Give it a one short run of Memtest (just a single pass) to see if the memory settings have been impacted by the new BIOS. Then check to see if this problem will repeat. If no then you are good to go but monitor your HD temperatures (you can run a parity check) and observe the whole system and the individual drives temperature. Running HD above 50 degC is playing with fire. If the system crashes again you have two SM SASLP cards and you do not have that many drive so temporarily you can remove one of them to see if this fixes it. However this is not a long term solution. These SM cards originally were shipping with firmware 15N. The new ones (and you have them) are with firmware 21N. I am not sure exactly what is the difference between them but the newer firmware is much bigger and it appears that the manufacturer added RAID functionality that we do not need here. There is a known problem with 21N controllers and some of the AMD newer chipsets (880G) and the people that used them reported that the problems went away after they downgraded their firmware back to 15N. However you do have an Intel based board. If you want you can give it a try.
April 7, 201115 yr Author Thanks for all the suggestions. I have started by Flashing the Bios with the newest version (0402). I have changed the Sata to AHCI and Disabled everything!!! Well nearly everything One thing i have noticed is that the machine boots up extremely fast now... is this me disabling things? There is still the error on the SASLP driver showing half way through the bootup log though. I will now wait and see if this helps before moving on to step two which is tackling the firmware on the SM cards. I would just like to clarify one thing. The HD that "cooked" was pulled from an external USB drive. At no point have I ever knowingly run a Hard disk at extreme temperatures. Thanks again, and i will keep you posted. PS... i attach the bootup log for your records. syslog-2011-04-07.txt
April 7, 201115 yr Author I suggest when the problem happens to telnet into the unRaid server and running the following command: tail -n 50 /var/log/syslog this will give you the last 50 lines of the syslog. Then I would check how much free memory you have left by running: free I suspect your server is crashing due to log explosion, near the end of your log, you are getting a lot of drive IO errors, ~12 entries every 30s! I had a similar problem, and I isolated it to a bad drive, but in my case the drive was exhibiting this problem when I was trying to preclear it... in your case, might be more difficult to trouble shoot if you have data on the drives and you can't easily pull out each drive. If your memory free is really low, then that could be causing the server crashes and check for another post from me about log explosion and what I did to try to avoid server crashes due to log explosion. I would also suggest you try to reseat all your drive cables and also the SAS cards to make sure it's not a bad connection. Then try the obvious such as the below: hdparm -tT /dev/sdX where X are your drive letters, this will perform a speedtest, and really I want to see if you can still access the HDs from the server itself Then run the following: ethtool eth0 and ifconfig eth0 This will spit out the results of your network connection. I had a problem where the cable had a bad connection, I could telnet in, but transfers to/from the box was slow @ 170kb/s... Lastly, if you remember which video file you were playing, you could try to find which drive it's streaming from, it could that drive thats failing. I'm not that familiar with unRAID yet so not sure if you could pull that drive and just try to see if the server stops crashing without it freaking out that a drive is missing... Ok so after only 20 mins of usage... its crashed. And it happended when i sent a Spindown request to the server. Heres the last 50 lines of the log obtained via telnet: Apr 7 19:18:02 Tower kernel: sas: command 0xc37c2300, task 0xf75df680, timed ou t: BLK_EH_NOT_HANDLED Apr 7 19:18:02 Tower kernel: sas: Enter sas_scsi_recover_host Apr 7 19:18:02 Tower kernel: sas: trying to find task 0xf75df680 Apr 7 19:18:02 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf75df680 Apr 7 19:18:02 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abo rt_task:rc= 5 Apr 7 19:18:02 Tower kernel: sas: sas_scsi_find_task: querying task 0xf75df680 Apr 7 19:18:02 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_que ry_task:rc= 5 Apr 7 19:18:02 Tower kernel: sas: sas_scsi_find_task: task 0xf75df680 failed to abort Apr 7 19:18:02 Tower kernel: sas: task 0xf75df680 is not at LU: I_T recover Apr 7 19:18:02 Tower kernel: sas: I_T nexus reset for dev 0300000000000000 Apr 7 19:18:02 Tower kernel: sas: I_T 0300000000000000 recovered Apr 7 19:18:02 Tower kernel: sas: --- Exit sas_scsi_recover_host Apr 7 19:18:27 Tower in.telnetd[11957]: connect from 192.168.1.115 (192.168.1.1 15) Apr 7 19:18:29 Tower login[11958]: ROOT LOGIN on `pts/0' from `Darren-LT.lan' Apr 7 19:18:33 Tower kernel: sas: command 0xc37c2300, task 0xf75df680, timed ou t: BLK_EH_NOT_HANDLED Apr 7 19:18:33 Tower kernel: sas: Enter sas_scsi_recover_host Apr 7 19:18:33 Tower kernel: sas: trying to find task 0xf75df680 Apr 7 19:18:33 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf75df680 Apr 7 19:18:33 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abo rt_task:rc= 5 Apr 7 19:18:33 Tower kernel: sas: sas_scsi_find_task: querying task 0xf75df680 Apr 7 19:18:33 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_que ry_task:rc= 5 Apr 7 19:18:33 Tower kernel: sas: sas_scsi_find_task: task 0xf75df680 failed to abort Apr 7 19:18:33 Tower kernel: sas: task 0xf75df680 is not at LU: I_T recover Apr 7 19:18:33 Tower kernel: sas: I_T nexus reset for dev 0300000000000000 Apr 7 19:18:33 Tower kernel: sas: I_T 0300000000000000 recovered Apr 7 19:18:33 Tower kernel: sas: --- Exit sas_scsi_recover_host Apr 7 19:19:04 Tower kernel: sas: command 0xc37c2300, task 0xf75df680, timed ou t: BLK_EH_NOT_HANDLED Apr 7 19:19:04 Tower kernel: sas: Enter sas_scsi_recover_host Apr 7 19:19:04 Tower kernel: sas: trying to find task 0xf75df680 Apr 7 19:19:04 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf75df680 Apr 7 19:19:04 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abo rt_task:rc= 5 Apr 7 19:19:04 Tower kernel: sas: sas_scsi_find_task: querying task 0xf75df680 Apr 7 19:19:04 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_que ry_task:rc= 5 Apr 7 19:19:04 Tower kernel: sas: sas_scsi_find_task: task 0xf75df680 failed to abort Apr 7 19:19:04 Tower kernel: sas: task 0xf75df680 is not at LU: I_T recover Apr 7 19:19:04 Tower kernel: sas: I_T nexus reset for dev 0300000000000000 Apr 7 19:19:04 Tower kernel: sas: I_T 0300000000000000 recovered Apr 7 19:19:04 Tower kernel: sas: --- Exit sas_scsi_recover_host Apr 7 19:19:35 Tower kernel: sas: command 0xc37c2300, task 0xf75df680, timed ou t: BLK_EH_NOT_HANDLED Apr 7 19:19:35 Tower kernel: sas: Enter sas_scsi_recover_host Apr 7 19:19:35 Tower kernel: sas: trying to find task 0xf75df680 Apr 7 19:19:35 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf75df680 Apr 7 19:19:35 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abo rt_task:rc= 5 Apr 7 19:19:35 Tower kernel: sas: sas_scsi_find_task: querying task 0xf75df680 Apr 7 19:19:35 Tower kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_que ry_task:rc= 5 Apr 7 19:19:35 Tower kernel: sas: sas_scsi_find_task: task 0xf75df680 failed to abort Apr 7 19:19:35 Tower kernel: sas: task 0xf75df680 is not at LU: I_T recover Apr 7 19:19:35 Tower kernel: sas: I_T nexus reset for dev 0300000000000000 Apr 7 19:19:35 Tower kernel: sas: I_T 0300000000000000 recovered Apr 7 19:19:35 Tower kernel: sas: --- Exit sas_scsi_recover_host
April 7, 201115 yr Author I've done some research, it seems this has happened to other people. Specifically on page 2 of the this thread: http://lime-technology.com/forum/index.php?topic=6423.0 Would it be bad form to resurrect that thread?
Archived
This topic is now archived and is closed to further replies.