August 31, 200817 yr If a user wants to upgrade a disk, unRAID’s data redundancy works quite well and allows it to be rebuilt perfectly. But if a drive fails, how likely is it that parity will be able to weather the failure storm and be able to rebuild the drive to its “prefailed” state? If a failing drive responds to a read request with bad data (binary zeros or corrupted data), parity would be instantly updated and the ability to recover that block lost. If it failed during a parity check this affect could be multiplied and the ability to recover anything of value impacted. So perhaps the fact that unRAID does an excellent job of maintaining parity “during the good times” is not enough. In order to be effective, parity has to be left in a recoverable state AFTER a drive failure occurs in the system. That's the point, right? At the heart of this issue is the decision to trust the data drives over the parity drive. In most situations this is a good assumption, and helps keep unRAID working smoothly with minimal user intervention. But after considering failure scenarios, I am not convinced that this is the correct behavior model. I know so little about real drive failures. I know that normally they just won’t power up. But for drives that fail in more creative ways, I’m not sure what they do. Perhaps someone who knows can say with confidence that even as they fail that drives will not send back bad errant data. But after reading these forums for a while and seeing the screwy things that happen in the real world, I am thinking that this issue may represent a big risk to drive recovery.
August 31, 200817 yr I lost data on a good drive, because an empty drive had gone bad, and was causing parity check failures. The drive that went bad didn't get failed, a different drive that was slow to spin up showed as failed in the interface. I very happily replaced that drive, and let it be rebuilt. It wasn't until AFTER that rebuild was completed successfully according to the interface that I started running parity checks, and it wouldn't complete. I pulled the drive that showed failed, and it passed a full smart test on another machine, so I put it back into service. However, instead of pressing the restore button, I let it rebuild. Now, I have 2 drives with almost identical contents, neither of which is the same as original, but so close as to be maddening. 400GB of worse than lost data, because I can see the files and read some of them. I have since removed the ACTUAL bad disk, and parity checks come up clean every time. Too late for my data though. Now I am left to sift through the contents of 2 drives, and attempt to figure out the most correct versions among the duplicates. Now for the punchline... This data was the remnants of a prior recovery from a failed RAID5 array, so I already couldn't trust its integrity, but now I have twice as much to sift through searching for any good data. At this point I am very tempted to write it off.
August 31, 200817 yr It has been a long time since drives could return bad data. That is what sector CRC's are for, to ensure that every sector is returned perfectly. We used to talk in terms of hard and soft disk errors, where soft errors were those with a bad CRC or other sector format problem. But nowadays, those are all hidden and handled transparently. If it cannot reconstruct the sector contents perfectly, then some type of read error is returned, NOT the data. From unRAID's perspective, I don't see anything different between upgrading a drive and replacing a failed drive. Both should be safe operations. There are enough things to worry about, let's not make up more, OK? I can't say I have a handle on jonathanm's situation yet, maybe I need to read it through a few more times. It does sound like you need to copy off whatever seems good and intact, and wipe the drive and start over. I think I would first select and move off everything that looked intact and worth saving, then a second pass to select everything else I really want to keep, but am unsure of the integrity, and save them into a separate archive for later review. Then reformat the drive.
September 1, 200817 yr If a failing drive responds to a read request with bad data (binary zeros or corrupted data), parity would be instantly updated Forgive my ignorance, perhaps I've just read it wrong but that statement seems to imply that parity is updated when data is read from a data drive within the array, I thought parity was only updated when data is written to a drive (excluding during forced parity checks of course) due to the fact my parity is usually spun down whilst data drives are still spun up which would make that particular scenario you describe impossible to occur wouldn't it?.
September 1, 200817 yr Author Forgive my ignorance, perhaps I've just read it wrong but that statement seems to imply that parity is updated when data is read from a data drive within the array, I thought parity was only updated when data is written to a drive (excluding during forced parity checks of course) due to the fact my parity is usually spun down whilst data drives are still spun up which would make that particular scenario you describe impossible to occur wouldn't it?. You are 100% correct. The only time parity would be updated is on a write. My hypthetical was incorrect. But the situation I outlined could definitely happen in a parity check and, depending on which drive was failing, during a write operation. It has been a long time since drives could return bad data. That is what sector CRC's are for, to ensure that every sector is returned perfectly. We used to talk in terms of hard and soft disk errors, where soft errors were those with a bad CRC or other sector format problem. But nowadays, those are all hidden and handled transparently. If it cannot reconstruct the sector contents perfectly, then some type of read error is returned, NOT the data. From unRAID's perspective, I don't see anything different between upgrading a drive and replacing a failed drive. Both should be safe operations. There are enough things to worry about, let's not make up more, OK? I can't say I have a handle on jonathanm's situation yet, maybe I need to read it through a few more times. It does sound like you need to copy off whatever seems good and intact, and wipe the drive and start over. I think I would first select and move off everything that looked intact and worth saving, then a second pass to select everything else I really want to keep, but am unsure of the integrity, and save them into a separate archive for later review. Then reformat the drive. Thanks for that feedback. It is reassuring to know that by design returning of inaccurate data is prevented, and am not trying to unnecessarily raise an alarm. Certainly a drive can fail of any of an infinite number of ways, and at least some of those failures could result in sending back bad data. But if I am hearing your message, if drives fail in the normal ways that drives fail, that the drive electronics would not let it spew garbage back to the OS and corrupt parity. I still think I would like a way to run a non-correcting parity check to keep Murphy at bay.
September 8, 200817 yr I lost data awhile ago fumbling with the interface and choosing badly, I think things are now much clearer and had I been smart I'd have made an image of the good disk I was putting back in after an issue getting a replacement drive working. This issue was bizarre to say the least and not expected. I also lost data due to ResierFS corruption. I do not recall the circumstances but I'd lost power to my system a few times or had been forced to shut it down with the power switch. I began seeing some weirdness accessing some files and sure enough a FSCK found problems and a small amount of data was lost. That said - I have had at least two HDDs die on me. They powered up but threw errors that unRAID detected and the drives were locked out. Pretty normal drive failures all things considered. Swapped in new drives, one after a week's wait, and the data restored fine. I have two unRAID, one IDE and one SATA with a total of over 20 disks spinning and I consider my data pretty safe. I do have an offline backup of my music collection but my DVDs are too big to backup so unRAID is my protection and I sleep fine at night
Archived
This topic is now archived and is closed to further replies.