February 7, 201412 yr Greetings, I have been running a 10 TB+ unRAID system for about 2 years and I just got my first red ball! (yay :-), So, of course, I spent several hours reading about troubleshooting, etc and I still have some questions (isn't that always the way?). So, I added a new SDD, intending to use it as a cache drive, I rebooted and I got a red ball on one of the old 2 TB drives. My first stop was here: http://lime-technology.com/wiki/index.php/Troubleshooting. Under "Hard drive failures" it says: " It is actually easy to miss a failure unless you notice degraded performance." Question #1: is there some way to be informed (maybe via email?) at the time of the failure? Since I don't know when the failure occurred, I have no way of knowing what data was being written when the "write failed". Then I read the troubleshooting guide and it says "read your syslog". This leads to question #1: since the syslog is overwritten on every boot, why should I assume that the current one has any information on error? Next, I found this: http://lime-technology.com/wiki/index.php/FAQ#What_does_the_Red_Ball_mean.3F "Do not be misled by the fact that you can still read and write to the drive with a red ball indicator. You are, in fact, writing to the parity drive as if the failed drive was working." Question #2: I understand that any read to the disabled drive will be reconstructed from the data and parity on the other disks. I also understand that any write to the disabled disk will cause a parity update. However, this means that the only place this data is stored is on the parity disk and therefore, it is not "redundant", i.e. not protected. This would imply that, if I avoid all writes to the disabled disk, then I have no risk of losing data (all data is "protected"). Is this true? Next, trying to decide on a course of action, I tried to determine if the disk was "bad". I generated a smart report (attached). The only thing that stands out is the UDMA_CRC_Error_Count of 188. The article suggested that this might be a bad cable or bad signaling. Since it has worked for 2 years and I have not altered it, I assume that a cable does "go bad". Is this a good assumption? So now I need to decide: do I leave the drive in and reconstruct the data? This post makes a good argument for not doing this: http://lime-technology.com/forum/index.php?topic=2865.msg23755#msg23755 Or do I simply assume that the data (except of course the write that got the error that caused the disk to be taken out of service) is good and do the "trust my array" procedure? This will only work if the disk is really "good" and the write was a fluke or some kind of one-off error. The problem is: there is no way to know this. Or do I buy a new disk and reconstruct on that one and then run a bunch of tests on the old one (so old that I cant find any proof of when I bought it). I am proceeding with the final solution. I ordered a new drive and I will not write to the bad disk until I get it and run a reconstruct on it. Thanks for any help. syslog-2014.zip smart_out_sdd.txt
February 8, 201412 yr A few notes ... First, NONE of your disks are now protected -- the entire array is "at risk" until you replace the failed disk. It makes no difference if you write to the failed disk or any other disk. The difference is that if you write to a different disk the actual disk write occurs, and parity is updated. But you could NOT reconstruct that data if needed, since a reconstruction requires reading from ALL disks except the one being reconstructed -- and you have a failed disk already. Writes to the failed disk would actually result in parity being updated, but the actual write to the disk not occurring. But you could still read that data just fine, as it would be reconstructed from all the other disks. And when you replace it with a new disk, the reconstructed disk will include all of the newly written data. Nevertheless, it's not a good idea to use the array any more than necessary while you're running at risk. Personally, if I had a disk failure, I'd simply shut off the server until I could replace the disk. After you've installed a new disk and reconstructed the data, and then run a parity check to confirm all was well, THEN you can attach the old disk to another system and run diagnostics on it and see if it seems to be okay. If it fails the manufacturer's diags, I'd toss it (or RMA it if it's still under warranty).
February 10, 201412 yr Because your red ball happened immediately following your maintenance on the system (adding the SSD), I would shutdown your system and reseat the power and data cables for the one failed drive. It's possible (because of the type of error you are getting) that something just got jarred a bit during your maintenance. I would even go ahead and replace the data cable for that matter. Could be doing this will restore the drive back to normal and you can rest assured your problem was just the data connection and the drive is working just fine. If the drive comes back good, I'm not sure how unRAID handles that since there have been updates to parity (never dealt with a drive failure myself yet in unRAID).
February 10, 201412 yr The drive won't "... restore ... back to normal" => but it may be possible to rebuild it onto the same physical drive. Won't hurt to try, if a loose cable or poorly seated drive is in fact the issue.
February 10, 201412 yr The drive won't "... restore ... back to normal" => but it may be possible to rebuild it onto the same physical drive. Won't hurt to try, if a loose cable or poorly seated drive is in fact the issue. If there is a red ball condition at boot time, will unRAID start automatically? I thought it would wait for user input before starting up.
February 10, 201412 yr Author Hi there: Still no reply to my other question: is there any way to get unRAID to proactively inform me of a drive failure? Such as sending an email or some other way? checking the GUI every now and then for a red ball seems like a bad design.
February 10, 201412 yr Author Because your red ball happened immediately following your maintenance on the system (adding the SSD), I would shutdown your system and reseat the power and data cables for the one failed drive. It's possible (because of the type of error you are getting) that something just got jarred a bit during your maintenance. I would even go ahead and replace the data cable for that matter. Could be doing this will restore the drive back to normal and you can rest assured your problem was just the data connection and the drive is working just fine. If the drive comes back good, I'm not sure how unRAID handles that since there have been updates to parity (never dealt with a drive failure myself yet in unRAID). As I said in my first post, this post : http://lime-technology.com/forum/index.php?topic=2865.msg23755#msg23755 makes a good argument for NOT doing that. If the rebuild fails, you have lost all the data on the drive.
February 11, 201412 yr Hi there: Still no reply to my other question: is there any way to get unRAID to proactively inform me of a drive failure? Such as sending an email or some other way? checking the GUI every now and then for a red ball seems like a bad design. unmenu has a mail package and a status email package.
February 11, 201412 yr Hi there: Still no reply to my other question: is there any way to get unRAID to proactively inform me of a drive failure? Such as sending an email or some other way? checking the GUI every now and then for a red ball seems like a bad design. unmenu has a mail package and a status email package. +1 This has done a great job of notifying me when I have a drive failure. If you are not using unmenu, I highly recommend it.
Archived
This topic is now archived and is closed to further replies.