November 9, 201312 yr Hi All, I'm running version: 5.0-beta14 I'm having a problem at the moment with drives erroring. I had a drive red ball on me about a week ago, so i swapped it out with a brand new drive and rebuilt the array with no problems, however since then i've had other drives error. I now have a drive say red ball saying md: disk3: ATA_OP e0 ioctl error: -5 Does anybody know the reason for this? All the drives that failed are in the same backplane, i've tried using direct SATA connectors instead of the breakout cable, with no luck in changing it. What could be the issue here? could a faulty backplane be causing this? Or is it much more likely the drives are actually all dying at the same time.
November 9, 201312 yr In my experience when several drives appear to fail simultaneously it seem to most frequently be power related - either the cabling or the power supply itself. In such scenarios it is not unusual for one drive to take down all drives on the same adapter. I have also seen such symptoms for a badly seated adapter board. I assume in such a case then vibration might cause a momentary interruption in signals to the adapter board.
November 9, 201312 yr Author Hi itimpi thanks for the reply. Is there any recommended way of me testing this out? Or are you saying that it could well be, whatever the problem is has 'killed' the other drives as well & so just check all cabling & replace all drives? I don't want to put another drive in to then kill that drive, without first fixing the underlying problem. I just cant see how i can single out the cause. Edit: Just caught the edit! I will go check all the connectors and boards to make sure they're all properly seated. Thank you
November 9, 201312 yr It is likely that the drives themselves have not failed - just that they have dropped offline for an instance. The moment unRAID gets a write failure it will redball the drive. Even if you subsequently fix the issue that caused the redball, the drive will remain redballed until you take some action to tell unRAID that the drive contents are OK.
November 9, 201312 yr It is likely that the drives themselves have not failed - just that they have dropped offline for an instance. The moment unRAID gets a write failure it will redball the drive. Even if you subsequently fix the issue that caused the redball, the drive will remain redballed until you take some action to tell unRAID that the drive contents are OK. How do you tell unRAID the drive is OK without doing a new config setup? Something i've never seemed to find.
November 9, 201312 yr How do you tell unRAID the drive is OK without doing a new config setup? Something i've never seemed to find. The only way I know of is to stop the array; unassign the drive; start the array without the drive assigned; stop the array again; assign the drive; and then restart the array to rebuild the drive from the other drives and parity. If you are reasonably certain the drive is OK, then the new config approach is normally the fastest. I often take a hybrid approach where I take the first approach to get the disk invalidated, but then use a spare disk for the rebuild instead of the one unRAID has just redballed. That means I have the original disk (which is probably OK) as a fall back for recovering data in the (unlikely) case where the rebuild fails. If the rebuild succeeds I then put the removed disk through a preclear_disk.sh cycle to check it is really Ok. While the safest approach, this does require you to have a spare drive available.
November 9, 201312 yr How do you tell unRAID the drive is OK without doing a new config setup? Something i've never seemed to find. The only way I know of is to stop the array; unassign the drive; start the array without the drive assigned; stop the array again; assign the drive; and then restart the array to rebuild the drive from the other drives and parity. If you are reasonably certain the drive is OK, then the new config approach is normally the fastest. I often take a hybrid approach where I take the first approach to get the disk invalidated, but then use a spare disk for the rebuild instead of the one unRAID has just redballed. That means I have the original disk (which is probably OK) as a fall back for recovering data in the (unlikely) case where the rebuild fails. If the rebuild succeeds I then put the removed disk through a preclear_disk.sh cycle to check it is really Ok. While the safest approach, this does require you to have a spare drive available. Remember though that the drive was taken off-line WHEN A WRITE TO IT FAILED Therefore, it is guaranteed that its contents are not correct. (remember... a WRITE TO IT FAILED !!!!) Best bet is to re-construct it. If you force it back online, you should perform a file-system check at the very least.... remember ... a WRITE TO IT FAILED and the file-system might be corrupt.
November 9, 201312 yr I don't think Joe emphasized enough that "A WRITE TO IT FAILED" !! 8) In other words, it's virtually guaranteed that there's at least one bad sector of data on the drive. You indicated that you've had other errors in the past week -- those were (I assume) just read errors (since you didn't have additional red-balls) ... which means UnRAID wrote that data back okay; but it also indicates that you're definitely having some issues with either your power; your cables; or (since they're all in the same backplane) that specific backplane. There's no "magic bullet" method of isolating which of those is the issue. You've already tried using SATA cables instead of a breakout connector -- but was that to the backplane or directly to the drives (removed from the backplane)? From what you've outlined; I'd try removing the drives from that backplane. Also, post the details of your configuration. ... and while you're at it; upgrade to v5.0 => you're running a very early Beta release.
November 11, 201312 yr Using the reconstruction of data from parity is the only way I know of to get a red balled back. Is their another way (1st way is still the safest, but is their a quick way to get it back up? Force it back ect)
Archived
This topic is now archived and is closed to further replies.