March 31, 201610 yr Hi guys, Unfortunately I suffered 2 hdd issues in 3 days so my server is now out of commission (only had one hot spare). I'm trying to figure out what happened. 4 days ago, one of the disks (a 10 year old 1TB hdd) started showing errors in the smart attributes (pending sectors and all), which I wasn't too surprised as it is probably the oldest drive in my array. I had a 3TB hot spare in the server already (precleared 3 times, sitting outside of the array). I quickly stopped the array, assigned the new disk and let it rebuild. About 20 hours later, everything was all good. 2 days later, I went down into the basement to check on the server. I noticed one of the drive cage fans rattling. Usually a couple of gentle taps take care of it. This time they didn't. I let it go. About an hour later, I noticed that the gui was not coming up. I tried to get the gui back by killing certain gui related processes running (every once in a while a process gets stuck and gui is unresponsive until it's killed) But that didn't work. Then I noticed that there were a ton of ata errors logged in the syslog. I initiated powerdown and rebooted. When it booted up, disk1 showed up as failed (Seagate Barracuda 2TB). No idea what happened (although the timing kinda corresponds to my gentle taps ). I checked all the cables and rebooted a bunch of times to no avail. Disk1 failed with no other info (that I can see). Here's the syslog with all the errors: http://pastebin.com/Dpb787VT I got a new drive and I'm preclearing it now (will take several days for 3 cycles) and the array will be down as I don't want another disk to fail in the meantime. Now the interesting part. I pulled the failed disk and hooked it up to my laptop through a sata-usb3 adapter and ran Seagate's diagnostics (SeaTools). I did short smart, generic short and generic extended (I believe reads all sectors). And they all passed. I am confused as to why the drive failed. Any ideas on what else I can try to get to the bottom of this? Thanks EDIT: You'll notice a couple of usb disconnects shortly before the disk errors start. Could this be a mobo problem perhaps? If so, what's the best course of action? Replace the mobo and unraid will automatically detect the drive as being fine?? or would I have to rebuild it regardless?
March 31, 201610 yr Community Expert I checked all the cables and rebooted a bunch of times to no avail. Disk1 failed with no other info (that I can see). After a disk is disabled it cant be enable by rebooting, you have to rebuild to a new or same disk, you should post a SMART report but if the disk looks fine you can rebuild to the same disk, in this cases I usually replace both cables or trade enclosure depending on your setup so if the same disk fails again it's probably bad.
March 31, 201610 yr Author Thanks for the answer. It seems once a write operation fails, there is no way you can reuse the drive without a data rebuild. So in my case, if it was truly a mobo issue, I'll replace the mobo, preclear the drive to make sure it wasn't a drive issue and then rebuild.
March 31, 201610 yr Community Expert Thanks for the answer. It seems once a write operation fails, there is no way you can reuse the drive without a data rebuild. So in my case, if it was truly a mobo issue, I'll replace the mobo, preclear the drive to make sure it wasn't a drive issue and then rebuild. It might be instructive to consider why a rebuild is required. The failed write and any subsequent writes are not on the drive, but they are part of parity, so a rebuild will recover the failed writes. Depending on what was being written when it failed, it is possible that not only did a file not get written completely, but updates to the filesystem itself might have been missed, so not only would you be missing data but you might even have filesystem corruption. So, the disk is considered invalid, but the emulated disk is not and all the data can be recovered by a rebuild.
April 1, 201610 yr Author Thanks for the answer. It seems once a write operation fails, there is no way you can reuse the drive without a data rebuild. So in my case, if it was truly a mobo issue, I'll replace the mobo, preclear the drive to make sure it wasn't a drive issue and then rebuild. It might be instructive to consider why a rebuild is required. The failed write and any subsequent writes are not on the drive, but they are part of parity, so a rebuild will recover the failed writes. Depending on what was being written when it failed, it is possible that not only did a file not get written completely, but updates to the filesystem itself might have been missed, so not only would you be missing data but you might even have filesystem corruption. So, the disk is considered invalid, but the emulated disk is not and all the data can be recovered by a rebuild. Thanks trurl for the clarification. It makes sense that once the writes were failed/missed, the data on the drive became unreliable. The preclear on the drive is almost complete and it looks like the drive is still good. I'll put it back in and let it rebuild. I'm convinced that it was a loose cable or something loose in/on the mobo. Yeah, I freaked out a little due to two hdd failures in three days when I hadn't had any in the last 3 years. Anyway, I jumped on the E5-2670 bandwagon that is getting a ton of attention over in the good deals section: https://lime-technology.com/forum/index.php?topic=46077.0 I figured it's time to retire the budget hardware and upgrade to more reliable server grade parts :-)
Archived
This topic is now archived and is closed to further replies.