Parity Check Failures - Same Sectors

jrhamilt · October 1, 2014

Hello all! I'm running Unraid 6.0 Beta 6, and it's been running smoothly for quite some time.

When I logged into the management console yesterday, I noticed that there were quite a bit of read error against one of my drives (300+) over the last couple months. During my investigation, I managed to do something that locked up the system (had to do with the "MyMain" application, probably requested too much of the syslog). As a result, I hard reset the server.

When it came back up, I ran a parity check with correction, and it came back with 31 issues.

I did a bit of research, and determined that those may have been due to the reboot, so I ran a second parity check (with correction again), and it came back with 31 issues, again. I checked the syslog, and those issues all occur in the exact same sectors on the two runs.

What should my next steps be? Currently, I'm running a memtest (as I saw that might be a culprit too), but I couldn't find a specific post about parity errors on the same sectors...

I've attached the syslog...

Thank you!

John

syslog-2014-09-30.txt

WeeboTech · October 1, 2014

Capture the smart log with

smartctl -a /dev/sd? where ? is the device of the drive in question.

you can post them here for people to assist with review.

jphipps · October 1, 2014

I am seeing the same issue. I ran 2 parity checks in a row, and it repaired the same block both runs with no errors at all in the logs.

jrhamilt · October 1, 2014

Attached are the Smart Logs.

Here's my configuration as well...

sdc = parity (3 TB)

sdd = Disk 1 (3 TB)

sdf = Disk 2 (640 GB)

sdb = Disk 3 (500 GB)

sde = Cache (320 GB)

I believe that the previous read errors I saw were all on sdd (Disk 1) - the 300 or so from the original issue... Can't tell where the parity problems are (of course)...

dev-sdb_-_ST3500630AS_6QG1BS25_-_2014-10-01.txt

dev-sdc_-_WDC_WD30EFRX-68EUZNO_WD-WMC4N0439159_-_2014-10-01.txt

dev-sdd_-_WEC_WD30EZRS-00MMMB0_WD-WCAWZ1221651_-_2014-10-01.txt

dev-sde_-_ST3320620AS_5QF15WDQ_-_2014-10-01.txt

jrhamilt · October 1, 2014

Smart Report for SDF.

Thank you so much for looking!

(I couldn't find anything myself when I checked, so hopefully somebody else finds something of note...)

dev-sdf_-_WDC_WD6400AARS-00Y5B1_WD-WCAV56654253_-_2010-10-01.txt

WeeboTech · October 1, 2014

I didn't see anything that stood out as significant, however, I might run a smart LONG test on these seagate drives for extra confidence.

You'll need to have the array offline in safe mode or disable the spin down timers.

I don't know that the Hardware_ECC_Recovered number means anything.

Maybe someone else can chime in on this one.

if you have hash sums I would check them regularly at this point.

Check the cable on this one

Model Family: Seagate Barracuda 7200.10

Device Model: ST3320620AS

Serial Number: 5QF15WKQ

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 7

195 Hardware_ECC_Recovered 0x001a 066 053 000 Old_age Always - 171156973

Model Family: Seagate Barracuda 7200.10

Device Model: ST3500630AS

Serial Number: 6QG1BS25

195 Hardware_ECC_Recovered 0x001a 057 048 000 Old_age Always - 126296894

WeeboTech · October 1, 2014

In looking at the syslog I see 'some' common parity corrections and others not common.

I would also suggest doing a smart long test on the parity drive.

Before all, I would do a memtest to insure there isn't a hardware issue there flipping bits.

jrhamilt · October 1, 2014

Did a memtest last night. Stopped it to get Smart reports. Got 5 passes with no errors. Is that enough, or do you want me to do more?

Now, about the hash sums... I imagine you're talking about something like an MD5 checksum, right? I don't have anything external to unRAID, is there a chance that unRAID has it?

I do have full backups of the array on Crashplan. Can I abuse that for a thorough md5 check? Any tools, etc., that I should be using from here out to calculate checksums and compare them?

WeeboTech · October 1, 2014

5 memtest passes should be enough.

Long tests on the seagates may reveal something, although I'm not sure that's totally the problem.

I would double check cables and paths. There were some UDMA CRC errors reported.

As far as md5sums, you can use a package called md5deep or hashdeep.

then it's something like

md5deep -r /mnt/disk1 > /somestorageplace/disk1.md5sums

You can do that for each of your disks.

Then later verify the files with

md5deep -c /somestorageplace/disk1.md5sums

jbartlett has this great shell called bitrot.sh

bitrot - a utility for generating sha256 keys for integrity checks (version 1.0)

http://lime-technology.com/forum/index.php?topic=35226.msg327803#msg327803

While I am working on my own set of tools, they are not ready for prime time.

jrhamilt · October 2, 2014

Started the long checks. Did all 5 drives just in case. Took array offline via web-gui, and also set default spin - down time to "never", then started the long tests. (hope that is what you were looking for...)

Now, I believe the cable errors are old (those drives have been in many machines...) but will check cables anyway next time I crack the case.

Do I need to get a drive on order? Is the current exercise trying to figure out which drive to swap?

I had hoped that some of the checksum stuff was built into the file system on unRAID, but I imagine that I'm getting confused with some of the other products I was researching...

jrhamilt · October 2, 2014

Here are the results for the Seagate Long Tests, and the small WD drive (the 3 TB drives will finish over night, and then I'll include them here...)

dev-sdb_-_ST3500630AS_6QG1BS25_-_Long_Test_Results_-_2014-10-01.txt

dev-sde_-_ST3320620AS_5QF15WDQ_-_Long_Test_Results_-_2014-10-01.txt

dev-sdf_-_WDC_WD6400AARS-00Y5B1_WD-WCAV56654253_-_Long_Test_Results_-_2010-10-01.txt

WeeboTech · October 2, 2014

So far these drives look OK to me.

I did not see any ATA timeout or kernel messages either.

As for the checksums in the filesystem, btrfs has that, but it's still considered experimental.

md5deep -r can make text hash sums of your drives for double checking.

Frankly, there was an issue with unRAID 6 Beta 7 & 8 and potential reiserfs corruption.

I wonder if it's possible there's some other software issue here. We'll be sure it's not the drives after all the tests come back.

You may want to consider using unRAID 6 Beta 10a.

If you can possibly use unRAID 5 still, you may want to go through the parity check with that.

Then capture the md5deep hash sums of everything.

jrhamilt · October 2, 2014

So, without previous hash sums, I basically don't know whether I'm ok or not... Which should be okay with my backups, I just won't know which files are bad until they fail

I never went to Beta 7 or 8, so I should be good there.

I'm at the mercy of your opinion on this - do you want me to goto unRAID 6b10a, or unRAID 5? My configuration is extremely simple (nothing but unMENU it would seem), so going to either one would be okay... (Just have to revert the Cache drive back to reiserfs, but I'm not doing anything strange with the Cache drive, so if it came out of the config altogether it would be fine too...)

Just have to wait another 7 hours or so for the extended checks to finish.

WeeboTech · October 2, 2014

Well you're already there with unRAID 6 beta6. No one reported corruption with that version.

I was just wondering if it had anything to do with new versions or not.

Since you have backups, if they are intact and the drives are in the same structure, you can do md5's on those and then use that text file to verify files on the current system. If the structures vary widely that would not be an easy task.

You would also have to keep in mind any changes since the backup would be reported as a fail.

FWIW, there's a windows tool called corz checksum which is pretty easy to use too.

Jeeze, if It were my machine, I would probably temporarily drop back, skip the cache drive temporarily.

Do the parity check, if that was clean, make the hash sums of the files.

if the parity check still does not come up clean, I would look at my hardware even closer.

if it was clean,

I might consider going to the either unRAID 6 beta 6, as you have, and verify parity and hash sums and then possibly update to beta 10 doing another parity check and hash sum check.

It's allot of work, but you have to find a comfort zone of known reliability.

jrhamilt · October 2, 2014

Any gotchas to going back to unRAID 5?

garycase · October 2, 2014

Any gotchas to going back to unRAID 5?

Other than the cache disk, there shouldn't be.

My nickel's worth ...

I'd go back to v5 (without a cache). Do a parity check to confirm there's no obvious corruption.

Then I'd use a comparison tool to compare your data to your backup disks (I like FolderMatch, but it's not free -- although there's a free demo that would let you do what you need here ... if you like it there's a nice discount if you buy within 7 days). This will take a good bit of time (likely several days) ... but then you'll know for sure whether the data's okay.

Then I'd create checksums for all your files (at least all of the static ones) ... on both the live array and on the backup disks. Then you can verify the files anytime without the need to dig out the backups; and you'll also be able to confirm your backups are good before using them, should that ever be necessary.

jrhamilt · October 2, 2014

Unfortunately, I don't have disk backups, just crashplan cloud backups, so...

WeeboTech · October 2, 2014

Any gotchas to going back to unRAID 5?

I don't have the answers on that one. I know the newer filesystems would not be supported.

Nor docker, I don't know about other features.

If you're brave, you may as well move to the latest beta to see if there were some quirky hardware bugs that were squashed.

I wouldn't do any writes if I didn't have to, I might just do a correcting parity and see what's occurs.

This is one of those years that I haven't embarked into the beta zone since I no longer have a separate room for computers and music. I have to keep hardware to a minimum.

garycase · October 2, 2014

Unfortunately, I don't have disk backups, just crashplan cloud backups, so...

Backups are backups -- you can download a spare disk-at-a-time worth of files from the cloud; compare those against what's on the array; then erase the spare disk; download another disk's worth; and repeat the process until it's done. Clearly this is time consuming ... but the vast majority of the time is "computer time" ... not "your time." Just how long this might take depends on how much data you have, but even a couple weeks is probably worthwhile to confirm all of your files are good.

Once you KNOW that, then I'd create checksums for all your files (Corz is an excellent tool for this); and I assume your automated Crashplan will then backup those checksums on the cloud as well (and since you will have already verified that the files are the same, the checksums on the cloud can be downloaded along with any file you're downloading so you can confirm it's good).

jrhamilt · October 2, 2014

Here are the rest of the SMART reports... They are pretty plain as well.

I've decided to try Beta 10a first, then go back to unRAID 5 if things are still goofy...

My box has (from time to time) trouble re-recognizing the flash drive as bootable because I've got so many physical devices.

Will let you know how things go with 10a's parity check.

Thanks again for everybody taking a look here...

dev-sdc_-_WDC_WD30EFRX-68EUZNO_WD-WMC4N0439159_-_Long_Test_Results_-_2014-10-01.txt

dev-sdd_-_WEC_WD30EZRS-00MMMB0_WD-WCAWZ1221651_-_Long_Test_Results_-_2014-10-01.txt

WeeboTech · October 2, 2014

At this point I'm not convinced it's a physical drive problem.

All the smart tests look OK and the extended tests all pass.

Let us know how you make out after the update to Beta 10a.

jrhamilt · October 2, 2014

Doesn't look good. Ran parity in 10a, and found the 31 errors in the first 6.6%. Had it set to correct errors. Ran a second check, and when I left for work, it had already identified some errors.

So, back to unRAID 5? Anybody have detailed steps, or is it "format drive and install from fresh"... I think I would Re install and only keep the key file, then manually reconfigure (including redoing unmenu from scratch...)

WeeboTech · October 2, 2014

If you have the space, it should be as simple as replacing the bzroot/bzimage (but don't take my word for it).

You can probably edit the syslinux.cfg file to point to the specific unraid 5 bzroot/bzimage if you've saved them with a naming convention.

I would also unassign the cache first to take it out of the equation.

I might rename the /boot/extra folder and/or rename/move any plugins.

In addition, come up in Safe mode to disable any subsequent installation/processing.

jrhamilt · October 3, 2014

The plot thickens... Upon arriving home, the array was unresponsive (something crashed - screen on the server had lots of error-looking lines, with what looks like memory addresses in brackets on the left...). On boot, a new Parity check ensued, and now the sectors that need parity corrections are different than the ones in the syslog posted originally...

Here's the Original Syslog section:

Sep 29 20:35:25 WhiteNAS kernel: md: correcting parity, sector=3232464
Sep 29 20:35:46 WhiteNAS kernel: md: correcting parity, sector=6440768
Sep 29 20:37:10 WhiteNAS kernel: ntfs: driver 2.1.30 [Flags: R/W MODULE].
Sep 29 20:37:20 WhiteNAS kernel: mdcmd (20): spinup 1
Sep 29 20:37:20 WhiteNAS kernel: 
Sep 29 20:37:45 WhiteNAS kernel: mdcmd (21): spinup 2
Sep 29 20:37:45 WhiteNAS kernel: 
Sep 29 20:37:52 WhiteNAS kernel: md: correcting parity, sector=25898656
Sep 29 20:37:57 WhiteNAS kernel: mdcmd (22): spinup 2
Sep 29 20:37:57 WhiteNAS kernel: 
Sep 29 20:41:16 WhiteNAS kernel: md: correcting parity, sector=51318064
Sep 29 20:41:25 WhiteNAS kernel: md: correcting parity, sector=52473848
Sep 29 20:41:50 WhiteNAS kernel: md: correcting parity, sector=55526368
Sep 29 20:43:21 WhiteNAS kernel: md: correcting parity, sector=66214008

Here's the current Syslog (or atleast, just the start of the scan...

Oct  2 19:57:18 WhiteNAS kernel: md: correcting parity, sector=4497784 (unRAID engine)
Oct  2 19:57:59 WhiteNAS kernel: md: correcting parity, sector=10992280 (unRAID engine)
Oct  2 19:58:47 WhiteNAS kernel: md: correcting parity, sector=18428832 (unRAID engine)
Oct  2 19:59:02 WhiteNAS kernel: md: correcting parity, sector=20920464 (unRAID engine)
Oct  2 19:59:23 WhiteNAS kernel: md: correcting parity, sector=24241184 (unRAID engine)
Oct  2 20:00:09 WhiteNAS kernel: md: correcting parity, sector=31472072 (unRAID engine)

Does this change the analysis?

WeeboTech · October 3, 2014

Seems like some kind of hardware/driver incompatibility. Take a pic of the screen next time.

Here's an idea, you could. Build a new parity on purpose (not just correct it, but start from scratch).

Then immediately do a parity check.

I'm going to assume you've not seen this behavior in unRAID 5.x.

The only way to prove/disprove it is to fall back and test the scenario out there.

Parity Check Failures - Same Sectors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation