Jump to content

[Solved] More parity errors...


Recommended Posts

A few months ago (early June) I had a large number of parity errors.  Eventually the processes ended in running riserfsck and loosing some files.  Fortunately, I have some off site backups I was able to restore from.

 

Today I found a file on my system inaccessible.  Going to the Unraid page show's the last parity check (8/1/12) had a couple thousand parity errors which I assume were corrected.  I'm re-running the parity check today but I'm very concerned that the drives remain online yet I seem to be regularly failing parity checks and it looks like I'm about to have more data loss with a file system corruption.

 

I've attached the syslog and smart reports from the parity and (currently) affected disk.  Both failures have now been reported on this disk, though it contains the bulk of the data so parity problems would likely result in errors on this disk.

 

Any help would be greatly appreciated.  I've had a disk failure before, but this is the first in ... some 6 or 7 years of unraid I've had these file system corruptions.

 

I'll report on the parity check status as well as a report on disk4 riserfs later.  This parity will take 7-8 hours to finish.

 

Here is a pastbin of the end of the syslog (full attached): http://pastebin.com/3UGEcd2i

syslog.zip

smart-parity.txt

smart-disk4.txt

Link to comment

I'm not sure what the duplicate files are about, but looking at what's duplicates those are the files I restored from the last file system error.

 

I thought the parity looked suspicious.

 

What's the proper plan of attack from here?  My wife isn't gonna be to happy if I let this repeat in the next 60-days again.

 

I'm guessing:

[*]Let the Parity check finish

[*]Replace parity disk cable

[*]Run long smart check on parity disk?

 

Am I on the right track for best shot at keeping my files?

Should I worry about the duplicates or wait until the parity has no sync errors first?

Link to comment

Sitting at ~60% complete on the parity check and I've already got 2150 sync errors reported.  Attached is the log since I rebooted and started the sync today.

 

This morning I noticed this b/c a movie my daughter was watching stopped playing 1/2 way through.  When I pulled it up on my desktop I couldn't even open the move.  I just checked and I can now play (at least the beginning) of the movie.  That could be due to the reboot or the parity check I'm not sure.

 

I'm not clear as to weather this parity check is doing a 'no-repair' or a 'repair' parity check.  I'm running version 4.7.

 

I'd like to go ahead and cancel the parity check, replace the cable on the parity drive, start a smartctl test and get those results.  Is that something I can do right now?  I'd like to handle as much of this today as I can since it will have to be offline Monday while I'm at work and my family has become use to having access to their movies.  We don't even have a DVD player in front of most of our TV's anymore.

syslog-restart.txt

Link to comment

I've replaced the cable on the parity drive though it was not obviously loose this at least eliminates the cable as a suspect.  Brand new cable was sitting in the closet in it's original bag, although it doesn't have the clip to hold it in like the old cable.  I also pushed on all the cables at both ends to insure a tight fit while I was there.

 

Attached are smart reports pre and post cable change on the parity and disk4.  None of the smart tests come up with errors and I see no change before and after the cable change.

 

I'm now running a long test on both disks in question. 

Here's my guess from here, is this accurate?

 

One smart test fails (yea!)

[*]Replace failed drive

[*]Rebuild parity

 

Neither smart test fails

[*]Run riserfsck on disk4 (not parity!)

[*]Wait for further instructions?

 

I'm unclear about how the riserfsck works and wonder if I messed up doing this a few months ago.  When doing the riserfsck since the parity drive is still in use, if there was a bad cable or a failing parity drive wouldn't unraid 'correct' the disk 4 data with what's on the parity causing riserfsck to call for a repair when really the parity was at falt?  I'm unclear what to do about a file system corruption when there are sync errors with the parity drive.

 

It seems to me at some point here I'm going to have to call one of the drives 'good' and the other 'bad'.  Since last time I just repaired Disk 4 (files vanished), and the smart report on parity has more questionable items I'd lean toward calling parity wrong.  How does that fit into the riserfsck and my next steps?

 

Thanks for the help, thought I had this licked so I just want be sure I've got some advice this time around to get it done right.  Really don't want be here the 1st of Sep b/c my monthly parity has failed  :o

smart-disk4-pre.TXT

smart-disk4-post.TXT

smart-parity-pre.TXT

smart-parity-post.TXT

Link to comment

Smart reports will state "pass" when the attributes each exceed the "THRESH" value.  Passing may mean nothing.

 

Parity:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

187 Reported_Uncorrect      0x0032  001  001  000    Old_age  Always      -      1856

 

ATA Error Count: 1837 (device log contains only the most recent five errors)

40 51 00 af 8e 46 00  Error: UNC at LBA = 0x00468eaf = 4624047

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 af 8e 46 e0 00      02:52:53.419  READ DMA EXT

  27 00 00 00 00 00 e0 00      02:52:53.419  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      02:52:53.418  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      02:52:53.418  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 00      02:52:53.418  READ NATIVE MAX ADDRESS EXT

 

Link to comment

I see, I didn't realize that. It looks like you've highlighted a single read error. Is that enough (threshold of 0)) to call the drive a problem?

 

Sounds like I could do one more cable swap and long test but the parity drive need to be replaced.

 

Is that an accurate understanding?

Link to comment

New drive is in, parity rebuilt and sync showed no errors.

 

I have a video which was being watched when I noticed this that appears to now be corrupt.  Is that coincidence or did the parity copy a bad sector to the data disk?

 

If it was a bad 'correction' from parity how do I detect and avoid that in the future?  I don't think the monthly parity check includes sync errors in the email so I kind of feel like it's a silent failure that could cause data corruption.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...