Parity Check Failures - Same Sectors


Recommended Posts

Hello all!  I'm running Unraid 6.0 Beta 6, and it's been running smoothly for quite some time.

 

When I logged into the management console yesterday, I noticed that there were quite a bit of read error against one of my drives (300+) over the last couple months.  During my investigation, I managed to do something that locked up the system (had to do with the "MyMain" application, probably requested too much of the syslog).  As a result, I hard reset the server.

 

When it came back up, I ran a parity check with correction, and it came back with 31 issues.

 

I did a bit of research, and determined that those may have been due to the reboot, so I ran a second parity check (with correction again), and it came back with 31 issues, again.  I checked the syslog, and those issues all occur in the exact same sectors on the two runs.

 

What should my next steps be?  Currently, I'm running a memtest (as I saw that might be a culprit too), but I couldn't find a specific post about parity errors on the same sectors...

 

I've attached the syslog...

 

Thank you!

 

John

syslog-2014-09-30.txt

Link to comment

Attached are the Smart Logs.

 

Here's my configuration as well...

 

sdc = parity (3 TB)

sdd = Disk 1 (3 TB)

sdf = Disk 2 (640 GB)

sdb = Disk 3 (500 GB)

sde = Cache (320 GB)

 

I believe that the previous read errors I saw were all on sdd (Disk 1) -  the 300 or so from the original issue...  Can't tell where the parity problems are (of course)...

 

<Second post to contain smart post for sdf>

dev-sdb_-_ST3500630AS_6QG1BS25_-_2014-10-01.txt

dev-sdc_-_WDC_WD30EFRX-68EUZNO_WD-WMC4N0439159_-_2014-10-01.txt

dev-sdd_-_WEC_WD30EZRS-00MMMB0_WD-WCAWZ1221651_-_2014-10-01.txt

dev-sde_-_ST3320620AS_5QF15WDQ_-_2014-10-01.txt

Link to comment

I didn't see anything that stood out as significant, however, I might run a smart LONG test on these seagate drives for extra confidence.

 

 

You'll need to have the array offline in safe mode or disable the spin down timers.

I don't know that the Hardware_ECC_Recovered number means anything.

Maybe someone else can chime in on this one.

if you have hash sums I would check them regularly at this point.

 

 

Check the cable on this one

 

Model Family:    Seagate Barracuda 7200.10

Device Model:    ST3320620AS

Serial Number:    5QF15WKQ

 

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      7

195 Hardware_ECC_Recovered  0x001a  066  053  000    Old_age  Always      -      171156973

 

 

 

 

Model Family:    Seagate Barracuda 7200.10

Device Model:    ST3500630AS

Serial Number:    6QG1BS25

195 Hardware_ECC_Recovered  0x001a  057  048  000    Old_age  Always      -      126296894

Link to comment

Did a memtest last night. Stopped it to get Smart reports. Got 5 passes with no errors. Is that enough, or do you want me to do more?

 

Now, about the hash sums... I imagine you're talking about something like an MD5 checksum, right? I don't have anything external to unRAID, is there a chance that unRAID has it?

 

I do have full backups of the array on Crashplan. Can I abuse that for a thorough md5 check? Any tools,  etc., that I should be using from here out to calculate checksums and compare them?

Link to comment

5 memtest passes should be enough.

 

 

Long tests on the seagates may reveal something, although I'm not sure that's totally the problem.

I would double check cables and paths. There were some UDMA CRC errors reported.

 

As far as md5sums, you can use a package called md5deep or hashdeep.

 

then it's something like

md5deep -r /mnt/disk1 > /somestorageplace/disk1.md5sums

 

You can do that for each of your disks.

Then later verify the files with

 

md5deep -c /somestorageplace/disk1.md5sums

 

jbartlett has this great shell called bitrot.sh

bitrot - a utility for generating sha256 keys for integrity checks (version 1.0)

http://lime-technology.com/forum/index.php?topic=35226.msg327803#msg327803

 

While I am working on my own set of tools, they are not ready for prime time.

Link to comment

Started the long checks. Did all 5 drives just in case. Took array offline via web-gui, and also set default spin - down time to "never", then started the long tests. (hope that is what you were looking for...)

 

Now, I believe the cable errors are old (those drives have been in many machines...) but will check cables anyway next time I crack the case.

 

Do I need to get a drive on order? Is the current exercise trying to figure out which drive to swap?

 

I had hoped that some of the checksum stuff was built into the file system on unRAID, but I imagine that I'm getting confused with some of the other products I was researching...

Link to comment

So far these drives look OK to me.

I did not see any ATA timeout or kernel messages either.

 

As for the checksums in the filesystem, btrfs has that, but it's still considered experimental.

md5deep -r can make text hash sums of your drives for double checking.

 

Frankly, there was an issue with unRAID 6 Beta 7 & 8 and potential reiserfs corruption.

 

I wonder if it's possible there's some other software issue here.  We'll be sure it's not the drives after all the tests come back.

 

You may want to consider using unRAID 6 Beta 10a.

 

If you can possibly use unRAID 5 still, you may want to go through the parity check with that.

Then capture the md5deep hash sums of everything.

Link to comment

So, without previous hash sums, I basically don't know whether I'm ok or not...  Which should be okay with my backups, I just won't know which files are bad until they fail :(

 

I never went to Beta 7 or 8, so I should be good there.

 

I'm at the mercy of your opinion on this - do you want me to goto unRAID 6b10a, or unRAID 5?  My configuration is extremely simple (nothing but unMENU it would seem), so going to either one would be okay...  (Just have to revert the Cache drive back to reiserfs, but I'm not doing anything strange with the Cache drive, so if it came out of the config altogether it would be fine too...)

 

Just have to wait another 7 hours or so for the extended checks to finish.

Link to comment

Well you're already there with unRAID 6 beta6.  No one reported corruption with that version.

I was just wondering if it had anything to do with new versions or not.

 

Since you have backups, if they are intact and the drives are in the same structure, you can do md5's on those and then use that text file to verify files on the current system. If the structures vary widely that would not be an easy task.

You would also have to keep in mind any changes since the backup would be reported as a fail.

 

FWIW, there's a windows tool called corz checksum which is pretty easy to use too.

 

Jeeze, if It were my machine, I would probably temporarily drop back, skip the cache drive temporarily.

Do the parity check, if that was clean, make the hash sums of the files.

if the parity check still does not come up clean, I would look at my hardware even closer.

 

if it was clean,

I might consider going to the either unRAID 6 beta 6, as you have,  and verify parity and hash sums and then possibly update to beta 10 doing another parity check and hash sum check.

 

It's allot of work, but you have to find a comfort zone of known reliability.

Link to comment

Any gotchas to going back to unRAID 5?

 

Other than the cache disk, there shouldn't be.

 

My nickel's worth ...

 

I'd go back to v5 (without a cache).  Do a parity check to confirm there's no obvious corruption.

 

Then I'd use a comparison tool to compare your data to your backup disks (I like FolderMatch, but it's not free -- although there's a free demo that would let you do what you need here ... if you like it there's a nice discount if you buy within 7 days).    This will take a good bit of time (likely several days) ... but then you'll know for sure whether the data's okay.

 

Then I'd create checksums for all your files (at least all of the static ones) ... on both the live array and on the backup disks.    Then you can verify the files anytime without the need to dig out the backups; and you'll also be able to confirm your backups are good before using them, should that ever be necessary.

 

Link to comment

Any gotchas to going back to unRAID 5?

 

 

I don't have the answers on that one. I know the newer filesystems would not be supported.

Nor docker, I don't know about other features.

 

 

If you're brave, you may as well move to the latest beta to see if there were some quirky hardware bugs that were squashed.

I wouldn't do any writes if I didn't have to, I might just do a correcting parity and see what's occurs.

 

 

This is one of those years that I haven't embarked into the beta zone since I no longer have a separate room for computers and music. I have to keep hardware to a minimum.

Link to comment

Unfortunately, I don't have disk backups,  just crashplan cloud backups, so...

 

Backups are backups -- you can download a spare disk-at-a-time worth of files from the cloud;  compare those against what's on the array;  then erase the spare disk; download another disk's worth;  and repeat the process until it's done.    Clearly this is time consuming ... but the vast majority of the time is "computer time" ... not "your time."    Just how long this might take depends on how much data you have, but even a couple weeks is probably worthwhile to confirm all of your files are good.

 

Once you KNOW that, then I'd create checksums for all your files (Corz is an excellent tool for this);  and I assume your automated Crashplan will then backup those checksums on the cloud as well (and since you will have already verified that the files are the same, the checksums on the cloud can be downloaded along with any file you're downloading so you can confirm it's good).

 

 

Link to comment

Here are the rest of the SMART reports...  They are pretty plain as well.

 

I've decided to try Beta 10a first, then go back to unRAID 5 if things are still goofy...

 

My box has (from time to time) trouble re-recognizing the flash drive as bootable because I've got so many physical devices.

 

Will let you know how things go with 10a's parity check.

 

Thanks again for everybody taking a look here...

dev-sdc_-_WDC_WD30EFRX-68EUZNO_WD-WMC4N0439159_-_Long_Test_Results_-_2014-10-01.txt

dev-sdd_-_WEC_WD30EZRS-00MMMB0_WD-WCAWZ1221651_-_Long_Test_Results_-_2014-10-01.txt

Link to comment

Doesn't look good. Ran parity in 10a, and found the 31 errors in the first 6.6%. Had it set to correct errors. Ran a second check, and when I left for work, it had already identified some errors.

 

So, back to unRAID 5?  Anybody have detailed steps, or is it "format drive and install from fresh"... I think I would Re install and only keep the key file, then manually reconfigure (including redoing unmenu from scratch...)

Link to comment

If you have the space, it should be as simple as replacing the bzroot/bzimage (but don't take my word for it).

 

You can probably edit the syslinux.cfg file to point to the specific unraid 5 bzroot/bzimage if you've saved them with a naming convention.

 

I would also unassign the cache first to take it out of the equation.

I might rename the /boot/extra folder and/or rename/move any plugins.

In addition, come up in Safe mode to disable any subsequent installation/processing.

 

 

 

Link to comment

The plot thickens...  Upon arriving home, the array was unresponsive (something crashed - screen on the server had lots of error-looking lines, with what looks like memory addresses in brackets on the left...).  On boot, a new Parity check ensued, and now the sectors that need parity corrections are different than the ones in the syslog posted originally...

 

Here's the Original Syslog section:

Sep 29 20:35:25 WhiteNAS kernel: md: correcting parity, sector=3232464
Sep 29 20:35:46 WhiteNAS kernel: md: correcting parity, sector=6440768
Sep 29 20:37:10 WhiteNAS kernel: ntfs: driver 2.1.30 [Flags: R/W MODULE].
Sep 29 20:37:20 WhiteNAS kernel: mdcmd (20): spinup 1
Sep 29 20:37:20 WhiteNAS kernel: 
Sep 29 20:37:45 WhiteNAS kernel: mdcmd (21): spinup 2
Sep 29 20:37:45 WhiteNAS kernel: 
Sep 29 20:37:52 WhiteNAS kernel: md: correcting parity, sector=25898656
Sep 29 20:37:57 WhiteNAS kernel: mdcmd (22): spinup 2
Sep 29 20:37:57 WhiteNAS kernel: 
Sep 29 20:41:16 WhiteNAS kernel: md: correcting parity, sector=51318064
Sep 29 20:41:25 WhiteNAS kernel: md: correcting parity, sector=52473848
Sep 29 20:41:50 WhiteNAS kernel: md: correcting parity, sector=55526368
Sep 29 20:43:21 WhiteNAS kernel: md: correcting parity, sector=66214008

 

Here's the current Syslog (or atleast, just the start of the scan...

Oct  2 19:57:18 WhiteNAS kernel: md: correcting parity, sector=4497784 (unRAID engine)
Oct  2 19:57:59 WhiteNAS kernel: md: correcting parity, sector=10992280 (unRAID engine)
Oct  2 19:58:47 WhiteNAS kernel: md: correcting parity, sector=18428832 (unRAID engine)
Oct  2 19:59:02 WhiteNAS kernel: md: correcting parity, sector=20920464 (unRAID engine)
Oct  2 19:59:23 WhiteNAS kernel: md: correcting parity, sector=24241184 (unRAID engine)
Oct  2 20:00:09 WhiteNAS kernel: md: correcting parity, sector=31472072 (unRAID engine)

 

Does this change the analysis?

Link to comment

Seems like some kind of hardware/driver incompatibility. Take a pic of the screen next time.

 

Here's an idea, you could. Build a new parity on purpose (not just correct it, but start from scratch).

Then immediately do a parity check.

 

I'm going to assume you've not seen this behavior in unRAID 5.x.

The only way to prove/disprove it is to fall back and test the scenario out there.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.