Arioch5 Posted August 5, 2012 Share Posted August 5, 2012 A few months ago (early June) I had a large number of parity errors. Eventually the processes ended in running riserfsck and loosing some files. Fortunately, I have some off site backups I was able to restore from. Today I found a file on my system inaccessible. Going to the Unraid page show's the last parity check (8/1/12) had a couple thousand parity errors which I assume were corrected. I'm re-running the parity check today but I'm very concerned that the drives remain online yet I seem to be regularly failing parity checks and it looks like I'm about to have more data loss with a file system corruption. I've attached the syslog and smart reports from the parity and (currently) affected disk. Both failures have now been reported on this disk, though it contains the bulk of the data so parity problems would likely result in errors on this disk. Any help would be greatly appreciated. I've had a disk failure before, but this is the first in ... some 6 or 7 years of unraid I've had these file system corruptions. I'll report on the parity check status as well as a report on disk4 riserfs later. This parity will take 7-8 hours to finish. Here is a pastbin of the end of the syslog (full attached): http://pastebin.com/3UGEcd2i syslog.zip smart-parity.txt smart-disk4.txt Link to comment
dgaschk Posted August 5, 2012 Share Posted August 5, 2012 Party may have bad or loose cable. Disk4 has a lot of duplicate files. Is this intentional? The parity SMART report shows errors and the disk4 report is clean. Link to comment
Arioch5 Posted August 5, 2012 Author Share Posted August 5, 2012 I'm not sure what the duplicate files are about, but looking at what's duplicates those are the files I restored from the last file system error. I thought the parity looked suspicious. What's the proper plan of attack from here? My wife isn't gonna be to happy if I let this repeat in the next 60-days again. I'm guessing: [*]Let the Parity check finish [*]Replace parity disk cable [*]Run long smart check on parity disk? Am I on the right track for best shot at keeping my files? Should I worry about the duplicates or wait until the parity has no sync errors first? Link to comment
Arioch5 Posted August 5, 2012 Author Share Posted August 5, 2012 Sitting at ~60% complete on the parity check and I've already got 2150 sync errors reported. Attached is the log since I rebooted and started the sync today. This morning I noticed this b/c a movie my daughter was watching stopped playing 1/2 way through. When I pulled it up on my desktop I couldn't even open the move. I just checked and I can now play (at least the beginning) of the movie. That could be due to the reboot or the parity check I'm not sure. I'm not clear as to weather this parity check is doing a 'no-repair' or a 'repair' parity check. I'm running version 4.7. I'd like to go ahead and cancel the parity check, replace the cable on the parity drive, start a smartctl test and get those results. Is that something I can do right now? I'd like to handle as much of this today as I can since it will have to be offline Monday while I'm at work and my family has become use to having access to their movies. We don't even have a DVD player in front of most of our TV's anymore. syslog-restart.txt Link to comment
Arioch5 Posted August 5, 2012 Author Share Posted August 5, 2012 I've replaced the cable on the parity drive though it was not obviously loose this at least eliminates the cable as a suspect. Brand new cable was sitting in the closet in it's original bag, although it doesn't have the clip to hold it in like the old cable. I also pushed on all the cables at both ends to insure a tight fit while I was there. Attached are smart reports pre and post cable change on the parity and disk4. None of the smart tests come up with errors and I see no change before and after the cable change. I'm now running a long test on both disks in question. Here's my guess from here, is this accurate? One smart test fails (yea!) [*]Replace failed drive [*]Rebuild parity Neither smart test fails [*]Run riserfsck on disk4 (not parity!) [*]Wait for further instructions? I'm unclear about how the riserfsck works and wonder if I messed up doing this a few months ago. When doing the riserfsck since the parity drive is still in use, if there was a bad cable or a failing parity drive wouldn't unraid 'correct' the disk 4 data with what's on the parity causing riserfsck to call for a repair when really the parity was at falt? I'm unclear what to do about a file system corruption when there are sync errors with the parity drive. It seems to me at some point here I'm going to have to call one of the drives 'good' and the other 'bad'. Since last time I just repaired Disk 4 (files vanished), and the smart report on parity has more questionable items I'd lean toward calling parity wrong. How does that fit into the riserfsck and my next steps? Thanks for the help, thought I had this licked so I just want be sure I've got some advice this time around to get it done right. Really don't want be here the 1st of Sep b/c my monthly parity has failed smart-disk4-pre.TXT smart-disk4-post.TXT smart-parity-pre.TXT smart-parity-post.TXT Link to comment
Arioch5 Posted August 6, 2012 Author Share Posted August 6, 2012 Both long smart tests attached, both passed. Not sure what I do now. long-smart-parity.TXT long-smart-disk4.TXT Link to comment
mbryanr Posted August 6, 2012 Share Posted August 6, 2012 Smart reports will state "pass" when the attributes each exceed the "THRESH" value. Passing may mean nothing. Parity: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1856 ATA Error Count: 1837 (device log contains only the most recent five errors) 40 51 00 af 8e 46 00 Error: UNC at LBA = 0x00468eaf = 4624047 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 af 8e 46 e0 00 02:52:53.419 READ DMA EXT 27 00 00 00 00 00 e0 00 02:52:53.419 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 02:52:53.418 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 02:52:53.418 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 02:52:53.418 READ NATIVE MAX ADDRESS EXT Link to comment
Arioch5 Posted August 6, 2012 Author Share Posted August 6, 2012 I see, I didn't realize that. It looks like you've highlighted a single read error. Is that enough (threshold of 0)) to call the drive a problem? Sounds like I could do one more cable swap and long test but the parity drive need to be replaced. Is that an accurate understanding? Link to comment
mbryanr Posted August 6, 2012 Share Posted August 6, 2012 Don't know if "Reported_Uncorrect" raw value is an actual count but I would bet that it is. 1856 sectors reported uncorrectable. I just picked out one that it reported.... Link to comment
Arioch5 Posted August 6, 2012 Author Share Posted August 6, 2012 Ah thanks, read that from my phone so I didn't notice that. Thanks! New hard drive it is. Link to comment
Arioch5 Posted August 12, 2012 Author Share Posted August 12, 2012 New drive is in, parity rebuilt and sync showed no errors. I have a video which was being watched when I noticed this that appears to now be corrupt. Is that coincidence or did the parity copy a bad sector to the data disk? If it was a bad 'correction' from parity how do I detect and avoid that in the future? I don't think the monthly parity check includes sync errors in the email so I kind of feel like it's a silent failure that could cause data corruption. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.