July 4, 200818 yr I didn't read far enough into the thread on the "Invalid Configuration" bit where it had the workaround to force unRAID out of that state. Instead, I de-allocated the parity device and then re-allocated and kicked off a parity build. The parity build logged 465 errors. So, given the attached SMART log and syslog, do I have some corrupted files hiding somewhere, or am I in good shape? If I do have some corrupt files, it is not a big deal, but it would be nice to know which ones are corrupt. Thanks, SirWired
July 4, 200818 yr You have a bad or failing disk, Disk 1 - sdc, the WD with serial number ending in 071. The SMART report indicates Pending Sectors value of 60 (very bad, should be zero), and UNC errors consistent with that, and with the disk errors logged in your syslog. These disk errors are causing the parity sync errors. ("UNC" errors correspond to "UNCorrectable Error in Data" as smartctl man pages states. (quoted from a post)) Unfortunately, I don't know a way to identify which files are involved in those bad sectors. I personally would not trust any of them, any files on this disk. If you can recover the files from another source, that would be better. Then, RMA this drive, your SMART report should help with the process. It is possible to continue using the drive, perhaps in a Windows machine, AFTER running either a SpinRite pass on it, or a SMART long test. But I don't recommend it, you can't really trust it, you probably don't know the cause of the bad sectors. Is it possible it was in use during a serious power spike, or sudden loss of power during writes to it?
July 5, 200818 yr sirwired - Sorry you are having problems. It does indeed look like your drive is misbehaving. I would certainly try replacing the SATA cable, and double check your backplanes to make sure you are not having some connectivity problems. If you find something that looks suspicious, try rebuilding parity again. Maybe you'll have better luck. But all indications are that your drive has indeed encountered several nasty read errors and is waiting to remap a number of sectors. The bad sectors seem to be close to each other, implying some catostrophic event caused the problem on a very localized part of the disk. This also likely means that only a very small (maybe just one) number of files are corrupted. The number of bad sectors as compared with 750G of space is very small. I'd say the vast majority of your data is clean, but certainly going back to a known "perfect" data state is preferred if you have it. I would recommend copying all the data off this drive ASAP. Monitor it closely, as you will likely get errors reading certain files. These are the files that are corrupted. If a file copies 100% clean, I beleive you can trust the data in that file. (Someone correct me if you don't agree). After you have copied all the data to another drive, rebuild your parity WITHOUT this drive and RMA the drive. If RMA isn't an option, you COULD try adding the drive back to the array, which would result in unRAID writing binary zeros to the entire drive. This will force sector remaps for all the pending bad sectors and likely get the drive behaving again. But with so many bad sectors I'd have trouble trusting the drive and would try to get it replaced, unless I was interested in experimenting.
July 5, 200817 yr I would recommend copying all the data off this drive ASAP. Monitor it closely, as you will likely get errors reading certain files. These are the files that are corrupted. If a file copies 100% clean, I beleive you can trust the data in that file. (Someone correct me if you don't agree). After you have copied all the data to another drive, rebuild your parity WITHOUT this drive and RMA the drive. If you remove this drive to RMA it, and do not intend to replace it with another, the only way to rebuild parity without it is to use the (poorly named) "restore" button to delete the existing disk configuration file and replace it with one based on the remaining working and assigned drives. This will re-build parity WITHOUT the misbehaving drive. If RMA isn't an option, you COULD try adding the drive back to the array, which would result in unRAID writing binary zeros to the entire drive. This will force sector remaps for all the pending bad sectors and likely get the drive behaving again. But with so many bad sectors I'd have trouble trusting the drive and would try to get it replaced, unless I was interested in experimenting. If the drive already has a reiserfs file-system (and it does), I don't think it will get zeroed out. It will just be added back in, files and all. Joe L.
July 5, 200817 yr Author Well, I ran the thing through the WD "Digital Lifegurard Tools", and... "She's Dead Jim". First, it popped up with a "Read Element Failure" and exhorted me to run a full media scan. This took three and a half hours and it said things were all fixed now. I ran the Quick test again, and it immediately failed, again. Booting unRAID back up with a Parity check said that it encountered even more errors, this time, "fixed" by my possibly bogus parity. Can I tell from my earlier syslog if it was able to recover the data during the parity build? If so, I can probably trust my data, if not, there is no point in doing a "Recover" if I don't have valid parity to build from. I have RMA'd the drive, and WD is pretty darn prompt with sending out replacements. SirWired
July 5, 200817 yr Well, I ran the thing through the WD "Digital Lifegurard Tools", and... "She's Dead Jim". First, it popped up with a "Read Element Failure" and exhorted me to run a full media scan. This took three and a half hours and it said things were all fixed now. I ran the Quick test again, and it immediately failed, again. Booting unRAID back up with a Parity check said that it encountered even more errors, this time, "fixed" by my possibly bogus parity. Can I tell from my earlier syslog if it was able to recover the data during the parity build? If so, I can probably trust my data, if not, there is no point in doing a "Recover" if I don't have valid parity to build from. I have RMA'd the drive, and WD is pretty darn prompt with sending out replacements. SirWired I think you are a tiny bit confused about the recovery of data. The process of pressing the "recover" button renamed the system.dat file on your flash drive and therefore forced your server to rebuild a new system.dat file and then re-compute parity based on the assigned drives. The errors that occurred on your parity build process were errors in reading one of your data drives. Normally, if unRAID had a valid parity drive, the parity drive in combination with the other drives would be used to return the block of data that could not be read. Since you did not have a parity drive that was valid, and in fact, you were trying to populate it, there is NO WAY for any data on the bad data drive to have been fixed by reading parity and the other data drives. Your SMART data indicated over 60 reads that had failed, and sectors marked for re-allocation if they were subsequently written. That has not occurred. since you have not re-written those same blocks. You will need to give us a better chronological description of events to learn when you pressed "Restore" when you started the array, when (or if) you typed the "mdcmd set invalidslot" command before we can figure out anything more. Can you see any files on the defective disk1 (/dev/sdc)? That would give some more clues. Can you tell us more about the sequence of events? Is the drive physically removed from your array? What does your main unRAID status page look like? Are any drives showing anything other than green? Joe L.
July 5, 200817 yr From his original description, I think what sirwired did was something like this - started the array with no parity, stopped the array, assigned the parity disk, and started the array again. Parity build ensued. I've never done that myself, but should be roughly the equivalent of pressing restore and starting the array. I don't think he typed in the set invalidslot command. I am sorry that this happened. I had been recommending that users run a parity check IN THE OLD UNRAID VERSION before upgrading. (I do this as a matter of course before doing any disk activity or OS upgrade.) If sirwired had done this, it is very likely that the errors he encountered would have happened while parity protection was in place and 100% data recovery would have been possible with unRAID. Now I think there are three options - 1 - Establish the array without the failing disk, and rebuild parity. When the new disk comes back, just add it to the array as an empty disk. (This assumes that the data from the failed disk can be, or has been, recovered in some other way.) 2 - Remove the failed disk from the array, but DON'T rebuild parity. Since unRAID completed parity build (albeit with errors) with the failing disk in place, unRAID will still simulate the removed disk. The simulated disk won't be perfectly reflected due to the errors calculating parity. But 60 bad sectors is not a disaster, so although some data loss is likely, the vast majority would be good. When the RMA'ed disk is returned, sirewired can put it back in and unRAID would rebuild the disk. 3 - Remove the parity disk from the array and mount the failed disk in one slot and a disk that has sufficient capacity in another. LEAVE THE PARITY SLOT EMPTY (be careful, it is easy to forget that parity is at the top and accidentally assign a data drive to the parity slot! You might just want to use the BOTTOM two disk slots instead of the top two to avoid this easy to make and deadly mistake). When you start the "array", you can copy the data from disk to disk. One complete, go step 1. sirwired, if you need any help with doing this, please post back and me or someone else will be able to help. But I sense from the lack of panic that you have this data backed up or you will be able to reconstruct relatively easily so you will go with step 1. Good luck!
July 8, 200817 yr Author Bjp has it exactly right. I unassigned parity, started, stopped, and then assigned parity again. I didn't do any CLI work. (I needed to read about two pages farther in the thread, where Tom posted the solution to the problem.) Yeah, in hindsight, I completely screwed up... I should have back leveled and run a parity check prior to destroying my parity. Since it looks like there is no way of telling which bad blocks belong to which files, I will probably choose option 1. The data on the disks consists of backups (already freshly re-generated onto the good data disk) and a lot of DVD .iso's. (This unit's primary use is as a DVD server for my Popcorn Hour.) Since I own all the DVD's, regenerating those will be tedious but not the end of the world. I'm kicking myself in the ass for screwing this up... I do enterprise storage support for a living (as in, support for storage environments measured in significant fractions of a petabyte); and one thing we try to get our customers to understand is that this is one of the primary weaknesses of RAID 3/5 (and a reason they shouldn't be using it for their truly mission-critical data.) The most common cause of data loss with RAID 3/5 is a "stripe kill", which is where you lose one disk, and then discover during the rebuild that one of your remaining disks has some bad tracks and you can't trust your data anymore. (In enterprise environments, data loss is almost always preferred to data corruption, because data loss is easily found and fixed by restoring from backup. Silent data corruption gives storage admins nightmares...) Most enterprise arrays actually stop the RAID rebuild in its' tracks if it encounters an unreadable track. It's all kinds of fun trying to convince customers that data loss significantly likely if they lose a drive in some god-awful 30-drive RAID 5 SATA array; we tell them to max out at about 15 drives, and that is with server Fibre Channel drives... With consumer-level drives (which is mostly what unRAID users use), I wouldn't go any more than eight unless it was data I could afford to lose. unRAID does make data loss a lot less painful, since you only lose the files on a single drive, while the remaining drives stay 100% readable and usable.. SirWired
Archived
This topic is now archived and is closed to further replies.