August 27, 200916 yr last night my tower went down, i couldn't get through with telnet, browser, or manual, manual is on an hdmi switch though, and for some reason won't display unless i am booting with it switched to tower already, if i switch while it's in use it won't display. either way though something was wrong, network drives couldn't connect and i couldn't access it through other 2 methods. I manually rebooted with my a/v receiver switched to tower, first i got some post beep codes, which i could not locate the meaning of at the moment, so shut down all the way and re-powered on, without any beep codes. I did get an error message though on boot up, in post, "Overclock Failed". It's actually barely overclocked, I just wanted to match the ddr500 memory, so it's at 2.5 rather than 2.2, but anyway I rebooted and dialed it down a touch to 2.4, rebooted no prob after that and started parity check. It was getting pretty late so i left it alone and went to bed. this morning i checked it and it was still running! it was crawling at ~60KB/s and had a 5 digit number for minutes remaining. rebooted again, just now, not really sure what else to do, started out fine, running at ~61,000KB/s and said ~280 minutes, looked normal, i just checked it though as i'm writing this and it's crawling again, if it's like earlier this morning i expect that it will stop functioning again soon, and will be unresponsive to telnet or browser access. incidentally i had two errors during the earlier parity check, none had happened this time. also, I realized how long it had been since i rebooted, i did (not on purpose) once during a power outage, twice actually, but besides that it's been several months, not sure if that matters, i don't think windows would like to run that long though without a reboot. from here all i can think to do is lower the cpu speed more, to see if that's affecting it, and prob either my motherboard, or cpu are fudged, prob cpu, it's the weakest link. any suggestions? thanks, rlr edit: i just tried restore, parity disk is restoring i guess, but other discs are up and running fine. if that helps....
August 27, 200916 yr If you can capture a syslog for us that would be great. That way some of us can look through it and see what might be going on. How long had it been since you had run a parity check? I know my server has been up for months and nothing has caused it to reboot yet. It might have been a disk problem that caused the syslog to fill all the available ram and then cause it to start killing things (i.e, NIC card, smb, etc). There is an addon that can help in these situations it is called powerdown and can be found in the user contributed wiki section, under addons i believe. Take a look at installing that and if/when this happens again you will be able to ctrl-alt-del your way to a safe shutdown AND capture a syslog int he process.
August 27, 200916 yr Make sure you have the memory voltage, timings, and clock speed set properly. Some BIOS get it right automatically, many do not. It must be set for your specific RAM strips... Oh yes, pressing the button labeled "restore" immediately invalidates parity, and immediately sets a new configuration based on the current assigned-and-working disks. When you then subsequently start the array, it starts calculating parity as if it was the very first time you started your array. You lose parity protection until it completes. The button does not "restore" parity, but throws away any prior parity calcs you might have done. You will then need to let it completely calculate it again before you are protected from a disk failure again. If you press "restore" when you have a disk failure, you throw away the contents of that failed disk (since it is not working, it will not be in the new configuration) and since you invalidated parity, you cannot rebuild the failed disk from parity either. You are lucky that all your data disks were working... I strongly suggest you wait till the parity build is complete, and then not use "restore" in the future unless specifically instructed. (there are rare occasions where it is necessary, as in the "Trust-my-Parity" procedure) Always use "Start" to start the array, even if replacing a failed drive, or upgrading the size of a drive. Joe L.
August 27, 200916 yr Author Make sure you have the memory voltage, timings, and clock speed set properly. Some BIOS get it right automatically, many do not. It must be set for your specific RAM strips... Oh yes, pressing the button labeled "restore" immediately invalidates parity, and immediately sets a new configuration based on the current assigned-and-working disks. When you then subsequently start the array, it starts calculating parity as if it was the very first time you started your array. You lose parity protection until it completes. The button does not "restore" parity, but throws away any prior parity calcs you might have done. You will then need to let it completely calculate it again before you are protected from a disk failure again. If you press "restore" when you have a disk failure, you throw away the contents of that failed disk (since it is not working, it will not be in the new configuration) and since you invalidated parity, you cannot rebuild the failed disk from parity either. You are lucky that all your data disks were working... I strongly suggest you wait till the parity build is complete, and then not use "restore" in the future unless specifically instructed. (there are rare occasions where it is necessary, as in the "Trust-my-Parity" procedure) Always use "Start" to start the array, even if replacing a failed drive, or upgrading the size of a drive. Joe L. thanks very much, both you guys. so if i have a failed disk, and it looks like i might, normally i would have been able to replace the parity with another disk and it would be rebuilt like any other? the results of my restore are that, yes thankfully all my data disks are intact, however a new parity could not be established it said that the parity drive had 288 errors. does that mean that my disk is dead? what about reformat? rlr
August 27, 200916 yr I'm not sure the 288 errors were on the parity drive. Those could have come from any off the drives, i believe. If you could capture smart reports for all of the drives and post them, along with a copy of your syslog, we might be able to tell you which disk is/was the one that was acting up. If i had to speculate i would guess that one of your disks is not playing nice and there were some bad sectors that developed, this caused the syslog to fill and bring the server down. After you have captured the syslog and smart reports run another parity check to make sure everything is good now. If it still comes back and is finding errors then something could be wrong and parity cannot be trusted. This is why bi-weekly/monthly parity checks are always a good idea. it keeps things like this from happening.
August 27, 200916 yr so if i have a failed disk, and it looks like i might, normally i would have been able to replace the parity with another disk and it would be rebuilt like any other? Yes, if the parity disk failed, you can replace it with another, and a new set of parity data will be written to it. the results of my restore are that, yes thankfully all my data disks are intact, however a new parity could not be established it said that the parity drive had 288 errors. does that mean that my disk is dead? what about reformat? rlr As already mentioned, either post a syslog, and others like myself on the forum can assist, or interpret it yourself. We have no way to know if your disk is dead. To know, you will need to run a "smartctl" command or two on it to learn of its health. You are lucky if you did not lose data... but unless you played every movie, and tried every file on your disks, you have no way to really know. If it is only your parity drive that is failing, then you'll be fine... But we don't know... Normally, pressing the "restore" button is the absolute worst action to take when your array is experiencing errors. It does NOT restore data. It invalidates parity. The only way to know what is happening is to look in the syslog, you must capture a copy after the errors occur, and before you reboot. Once you reboot it is replaced with a new file that may not have the clues needed. Look in the wiki for instructions http://lime-technology.com/wiki/index.php/Troubleshooting It is not enough to just "think" it is the drive that is failing... many of the problems we've encountered with problem reports like yours were not the drive itself, but the cabling to it. The syslog provides the clues needed to advise you, unless... like I said, you wish interpret it yourself. Just use google to look up lines in it that you think might be indications of errors... and interpret the search results... If you've failed to complete a new parity calculation, then NOW is the time to post your syslog... or at least capture a copy for your own analysis... before you reboot or power down. One last thing... the parity drive is never formatted... it is not partitioned, it does not have a file-system.... it is calculated bits only. So the concept of "re-formatting" is never appropriate. You must have a MS-Windows background... In fact, in all the situations of corruption of disks I've seen on the unRAID forum, only once was a re-format tried as a last resort to recover...( from somebody who zeroed their own data by pressing, you guessed it, the "restore" button) Joe L.
August 27, 200916 yr Author I'm not sure the 288 errors were on the parity drive. Those could have come from any off the drives, i believe. If you could capture smart reports for all of the drives and post them, along with a copy of your syslog, we might be able to tell you which disk is/was the one that was acting up. If i had to speculate i would guess that one of your disks is not playing nice and there were some bad sectors that developed, this caused the syslog to fill and bring the server down. After you have captured the syslog and smart reports run another parity check to make sure everything is good now. If it still comes back and is finding errors then something could be wrong and parity cannot be trusted. This is why bi-weekly/monthly parity checks are always a good idea. it keeps things like this from happening. doesn't it do an auto one every month or something? syslog: http://www.mediafire.com/?sharekey=6b391abe7448ab5e90a82c7bb0fad7ade04e75f6e8ebb871 thanks, rlr
August 27, 200916 yr I'm not sure the 288 errors were on the parity drive. Those could have come from any off the drives, i believe. If you could capture smart reports for all of the drives and post them, along with a copy of your syslog, we might be able to tell you which disk is/was the one that was acting up. If i had to speculate i would guess that one of your disks is not playing nice and there were some bad sectors that developed, this caused the syslog to fill and bring the server down. After you have captured the syslog and smart reports run another parity check to make sure everything is good now. If it still comes back and is finding errors then something could be wrong and parity cannot be trusted. This is why bi-weekly/monthly parity checks are always a good idea. it keeps things like this from happening. doesn't it do an auto one every month or something? No...it does not. Not unless you add your own script to do it as a customization.
August 27, 200916 yr I'm not sure the 288 errors were on the parity drive. Those could have come from any off the drives, i believe. If you could capture smart reports for all of the drives and post them, along with a copy of your syslog, we might be able to tell you which disk is/was the one that was acting up. If i had to speculate i would guess that one of your disks is not playing nice and there were some bad sectors that developed, this caused the syslog to fill and bring the server down. After you have captured the syslog and smart reports run another parity check to make sure everything is good now. If it still comes back and is finding errors then something could be wrong and parity cannot be trusted. This is why bi-weekly/monthly parity checks are always a good idea. it keeps things like this from happening. doesn't it do an auto one every month or something? syslog: http://www.mediafire.com/?sharekey=6b391abe7448ab5e90a82c7bb0fad7ade04e75f6e8ebb871 thanks, rlr No, the parity checks are not done automatically. Some of the forum members have set up cron tasks to do this. The script i use is below: #!/bin/sh crontab -l >/tmp/crontab grep -q "/root/mdcmd check" /tmp/crontab 1>/dev/null 2>&1 if [ "$?" = "1" ] then echo "# check parity on the 1st of every month at 1:30 am:" >>/tmp/crontab echo "30 1 1 * * /root/mdcmd check 1>/dev/null 2>&1" >>/tmp/crontab crontab /tmp/crontab fi this will run the check on the first Sunday of every month. There are multiple ways to get this to "install" itself when the server is rebooted, but the easiest is to probably add a line to your go script. Do some searching in the forums for Third Party Boot Architecture and go visit the wiki article about the subject. I know a lot of regular forum members use it, as do I, and it makes our lives easier when you come to use with problem about your server. P.S. this is just me being a geek, but i finally got SSH up and running on my server and was able to get the above script via cat from my office... so freaking cool.
August 27, 200916 yr Author the syslog is above, not sure how to get smart reports though. i got syslog through unmenu. it shows that all the errors are in disk 0, my parity drive, so doesn't that mean that's the offending drive? cheers, rlr
August 27, 200916 yr the syslog is above, not sure how to get smart reports though. i got syslog through unmenu. it shows that all the errors are in disk 0, my parity drive, so doesn't that mean that's the offending drive? cheers, rlr That is good news... but we don't know yet if it is the disk itself, or the cabling, or the drive tray (if you use one) On the unMENU disk management page, there is a drop-down list. Choose your parity disk (probably the first on the list) and press the "Smart Status Report" button. It runs the smartctl report on the selected drive... (scroll to the bottom of the page to see the output) Post the output. Joe L.
August 27, 200916 yr The Lib-ATA error message wiki has some clues for your "timeout" error... even it suggest it could be a missed interrupt from other hardware on your server... The SMART report should give some clues. http://ata.wiki.kernel.org/index.php/Libata_error_messages Joe L.
August 28, 200916 yr Author The Lib-ATA error message wiki has some clues for your "timeout" error... even it suggest it could be a missed interrupt from other hardware on your server... The SMART report should give some clues. http://ata.wiki.kernel.org/index.php/Libata_error_messages Joe L. i tried to get smart report for parity, it just spit out a dead link though does the array have to be stopped for smart status to run? in unmenu its says "Parity not valid: DSK_DSBL", do i have to reboot or something to get report? incidentally results of last parity check were same- 288 errors i should also mention, since you mentioned timeout error and other hardware, i just switched over to Win7 7600RTM build. i had another network issue with it too, it wouldn't reconnect my shared drives after reboot, with password stored. I fixed that easy enough, both my usernames for server and pc were the same, which i guess caused a conflict, i changed username in unraid Users list, deleting old one, works fine now. prob not related though.... thanks, rlr
August 28, 200916 yr Author i got smart report running with a reboot, will post when done thx again rlr
August 28, 200916 yr You are probably running a version of unRAID where the support library for the SMART "smartctl" command was not included. It can be installed from the unMENU package manager... It is the cxxlibs-6.0.8-i486.tgz library You can try clicking on the button to download and install the cxxlibs-6.0.8-i486.tgz library , and then try the smart report once more on the Disk Management page once more once it is installed. The array does not have to be off-line to run the smartctl command on a drive. It can be run at any time, on any drive. The DSK_DSBL status indicates that a write to the drive failed, and as a result unRAID has taken it out of service. It probably will not go back into service now unless un-assigned, rebooted without it, and then re-assigned. don't do that yet, but only if the smartctl command reports a working drive. (Right around now would be a good time to purchase/order a new parity drive) Since the drive seems to be alive enough for it to be detected by the OS, I'm probably going to rule out a cabling problem... A smart status report takes seconds... Did you request a "Long" or "Short" test... they take time to run. Regardless, unless you installed the cxxlibs support library, nothing is going to be run. Joe L.
August 28, 200916 yr Author You are probably running a version of unRAID where the support library for the SMART "smartctl" command was not included. It can be installed from the unMENU package manager... It is the cxxlibs-6.0.8-i486.tgz library You can try clicking on the button to download and install the cxxlibs-6.0.8-i486.tgz library , and then try the smart report once more on the Disk Management page once more once it is installed. The array does not have to be off-line to run the smartctl command on a drive. It can be run at any time, on any drive. The DSK_DSBL status indicates that a write to the drive failed, and as a result unRAID has taken it out of service. It probably will not go back into service now unless un-assigned, rebooted without it, and then re-assigned. don't do that yet, but only if the smartctl command reports a working drive. (Right around now would be a good time to purchase/order a new parity drive) Since the drive seems to be alive enough for it to be detected by the OS, I'm probably going to rule out a cabling problem... A smart status report takes seconds... Did you request a "Long" or "Short" test... they take time to run. Regardless, unless you installed the cxxlibs support library, nothing is going to be run. Joe L. yeah, it's been like hours now. i had to reboot my computer a couple times, but i didn't think that would matter. I will look for that file, or try to import it, i assume you mean from the unmenu webgui? the report i tried to run was Smart Statistics as you had advised earlier, is there another one i should do? thx, rlr
August 28, 200916 yr Author Smart Statistics for parity drive: http://www.mediafire.com/?sharekey=6b391abe7448ab5e90a82c7bb0fad7ade04e75f6e8ebb871 rlr
August 28, 200916 yr Smart Statistics for parity drive: http://www.mediafire.com/?sharekey=6b391abe7448ab5e90a82c7bb0fad7ade04e75f6e8ebb871 rlr The highlighted line below in your SMART report indicates a drive that is failing... It is not a cabling problem... Time to order a new drive to replace it. Joe L. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 198 198 051 Pre-fail Always - 25594 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 4333 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 690 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3795 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 81 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 67 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 690 194 Temperature_Celsius 0x0022 123 108 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 193 192 000 Old_age Always - 1295 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
August 28, 200916 yr Author Smart Statistics for parity drive: http://www.mediafire.com/?sharekey=6b391abe7448ab5e90a82c7bb0fad7ade04e75f6e8ebb871 rlr The highlighted line below in your SMART report indicates a drive that is failing... It is not a cabling problem... Time to order a new drive to replace it. Joe L. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 198 198 051 Pre-fail Always - 25594 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 4333 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 690 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3795 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 81 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 67 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 690 194 Temperature_Celsius 0x0022 123 108 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 193 192 000 Old_age Always - 1295 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 Hey, thanks a lot for your help Joe. i'm not sure i understand which drive it is though? if it's not the parity drive, can that drive be taken offline and the rest of the array used, while i get some cash together to fix it? it's weird, the report says old age repeatedly, but none of the drives are very old, most are less than 2 years old, 2 are about a year, the oldest being, i don't think more than 4. thanks again, damon
August 28, 200916 yr Hey, thanks a lot for your help Joe. i'm not sure i understand which drive it is though? if it's not the parity drive, can that drive be taken offline and the rest of the array used, while i get some cash together to fix it? it's weird, the report says old age repeatedly, but none of the drives are very old, most are less than 2 years old, 2 are about a year, the oldest being, i don't think more than 4. thanks again, damon The Key line is the one he highlighted. The "Current Pending Sectors" is the number of bad/damaged sectors that need to be relocated somewhere on the drive. That high of a number and the fact that it seems to have just happened and in such a dramatic leap is not a good thing. I have a couple drives that have reallocated sector counts in the 10-20 range, but none that have any "currently pending." And i know for sure that i have never had the "Current Pending jump to the 1300 range in one go. The Old_Age statsic on some of those is a little misleading and most can be ignored. Two key ones to pay attention to are the Reallocated sector count and the current pending sector count.
August 28, 200916 yr Smart Statistics for parity drive: http://www.mediafire.com/?sharekey=6b391abe7448ab5e90a82c7bb0fad7ade04e75f6e8ebb871 rlr The highlighted line below in your SMART report indicates a drive that is failing... It is not a cabling problem... Time to order a new drive to replace it. Joe L. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 198 198 051 Pre-fail Always - 25594 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 4333 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 690 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3795 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 81 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 67 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 690 194 Temperature_Celsius 0x0022 123 108 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 193 192 000 Old_age Always - 1295 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 Hey, thanks a lot for your help Joe. i'm not sure i understand which drive it is though? if it's not the parity drive, can that drive be taken offline and the rest of the array used, while i get some cash together to fix it? it's weird, the report says old age repeatedly, but none of the drives are very old, most are less than 2 years old, 2 are about a year, the oldest being, i don't think more than 4. thanks again, damon The drive that is failing is your parity drive. Yes, you can use your array without it. I'd remove it from the array so you can send it in for RMA It is: Device Model: WDC WD1001FALS-00J7B1 Serial Number: WD-WMATV0610437 According to this site, it is under warranty http://websupport.wdc.com/warranty/serialinput.asp All you need to do is follow their RMA process to get the replacement. All products are shipped UPS Ground. Turn-around times for Advance Replacement and Standard RMA's within the United States and European Union do include shipping time. For Advance Replacements, you have 30 days to return the defective product. If no defective product is returned within that time, you will be billed approximately 30 days from the date the replacement was shipped. From that quote, they will send you a drive even before getting your old one in return. As far as the "old-age"... You need to learn how to interpret the SMART report. All that the "old_age" label indicates the category of that parameter, That parameter's "worst" value reaches reaches that parameter's threshold, it will fail. The failure will indicate the drive is nearing its end of life (gotten to an old_age) and should be replaced as it has had its share of mechanical wear and tear. Actually, none of the smart tests failed the manufacturer's threshold... but you and I both know the drive is having write errors, and it is unable to re-allocate the sectors pending re-allocation... So... start the RMA process. Their normal turn-around is 5-7 days... You might get a replacement drive in the mail early next week if you request the advanced replacement. Joe L.
August 28, 200916 yr Author Hey, so now i've sent my drive off to WD, they have a pretty easy rma process, so that's good. But now not able to boot my machine, i'm getting a memory error POST beep code. I checked the memory individually in both slots. I'm pretty confident about the memory, but unfortunately don't have any other ddr memory to test it with. so it could be the processor, idk, but it seems too coincidental with the removal of the parity drive so i thought i'd run it by you guys. unfortunately it's not really worth it to buy a processor for that mobo, socket 939 having been phased out, so i'd prob be looking at a total rebuild. thanks, rlr
August 28, 200916 yr If you are definitely getting a memory post code beep then usually the CPU has nothing to do with it. Have you tried putting only one memory chip in at a time? I had my own computer business since 1983 until just a couple of years ago when I retired. It could also be the CPU but I would bet 9 times out of 10 it is the memory or the memory slots on the MB. You might try & use some compressed air & blow out the memory slots. Good luck to you. Trouble shooting can be a pain... P.S. You might want to unplug everything except the video card. If it boots up then add one thing at a time until it fails. Sometimes cards just need to be reseated.
August 29, 200916 yr Author so it look's like it will probably be a while until i get this box back together, with most likely having to buy new everything. am i able to get these disks to read from my pc, they're still ntfs right? thx, rlr
Archived
This topic is now archived and is closed to further replies.