Parity Check Failures - Same Sectors


Recommended Posts

Have reverted to 5, with a really fresh start.  Glad unRAID figured out what shares I already had, and picked them up automagically.  Was afraid that I might overwrite something.

 

During parity check, the following items have been corrected thus far:

 

Oct  2 20:23:46 WhiteNAS kernel: md: correcting parity, sector=24241184 (unRAID engine)
Oct  2 20:24:32 WhiteNAS kernel: md: correcting parity, sector=31472072 (unRAID engine)
Oct  2 20:33:04 WhiteNAS kernel: md: correcting parity, sector=112192360 (unRAID engine)
Oct  2 20:33:59 WhiteNAS kernel: md: correcting parity, sector=120882984 (unRAID engine)

 

For those playing at home, the first two are a part of the parity corrections done in beta 10a.

 

Stopping the parity check now, and rerunning to see if those same errors are picked up in the first 2%.

 

 

 

 

Link to comment

Woo hoo!

 

During the second run (after the partial first run), the first Parity error is here:

 

Oct  2 20:51:54 WhiteNAS kernel: md: correcting parity, sector=145897184 (unRAID engine)

 

So, it's looking like for whatever reason in unRAID 6, the parity corrections weren't being committed back to disk (or something)...  Is there any other info I can provide about my system that would help in "debugging" this issue (if it even is a SW issue)?

 

Will still be looking at getting a new disk for the Seagate with "Reallocated Sectors", but think that the array itself should be in okay shape.

 

Will let you know full details of the complete parity correction, and the following parity correction (to ensure it's still all good)...

 

Thanks again!

Link to comment

Right at this point a correcting parity, and then another would prove/disprove the health of the system.

 

While one drive had a few reallocated sectors, that's not so uncommon as long as they do not grow and they are not pending sectors.

 

What did concern me was the high Hardware_ECC_Recovered number.

 

Is this drive part of the array?

 

Model Family:    Seagate Barracuda 7200.10

Device Model:    ST3500630AS

Serial Number:    6QG1BS25

 

5  Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      10

195 Hardware_ECC_Recovered  0x001a  061  048  000    Old_age  Always      -      196007466

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

 

According to my calculations it's about 4.5 years running.

Hmm, only 10 reallocated sectors.

I suppose you could try moving the data to the other drives and preclearing it to see what comes up.

However I was going to suggest retiring it also.

 

I'm not sure that unRAID 6 is not writing correct parity.

I have suspicion that something is read being read incorrect. Perhaps the high ECC errors and/or raw read error rate has something to do with it.

Usually the long test will reveal something and you'll get some 'new' pending sectors.

 

You can always try the seagate diagnostics too.

 

I wonder if all the incorrect md sectors are within the first 500 MB or continue after.

 

Perhaps you need to describe your system in more detail.

Maybe something will be recognized by someone else.

 

Link to comment

WeeboTech, thanks again for all of your help!

 

That drive is a part of the array.  Currently, I've got about 120 GB used on that drive (of the 500 GB capacity).  Will look into moving data off, but want to get the parity checks done first.  Then, will do a detailed documentation

 

It looks like all the issues might be in that first half.  The first Parity Check is at 1.91 TB, and it's been steady at 71 sync errors for quite a while.  Here are the locations of the 71 (if someone can figure out if they are all in the first 500 GB...  I don't know how to tell.)

 

Is there some utility / method out there to figure out which files in the array might be affected by those sectors? Would love to then open them up and make sure that they are okay or atleast understand my risk a bit better...

 

Oct  2 20:35:59 WhiteNAS kernel: mdcmd (21): nocheck  (unRAID engine)
Oct  2 20:35:59 WhiteNAS kernel: md: md_do_sync: got signal, exit... (unRAID engine)
Oct  2 20:35:59 WhiteNAS kernel: md: recovery thread sync completion status: -4 (unRAID engine)
Oct  2 20:36:02 WhiteNAS kernel: mdcmd (22): check CORRECT (unRAID engine)
Oct  2 20:36:02 WhiteNAS kernel: md: recovery thread woken up ... (unRAID engine)
Oct  2 20:36:02 WhiteNAS kernel: md: recovery thread checking parity... (unRAID engine)
Oct  2 20:36:02 WhiteNAS kernel: md: using 1536k window, over a total of 2930266532 blocks. (unRAID engine)
Oct  2 20:51:54 WhiteNAS kernel: md: correcting parity, sector=145897184 (unRAID engine)
Oct  2 20:55:23 WhiteNAS kernel: md: correcting parity, sector=170787112 (unRAID engine)
Oct  2 20:55:43 WhiteNAS kernel: md: correcting parity, sector=173685656 (unRAID engine)
Oct  2 20:56:08 WhiteNAS kernel: md: correcting parity, sector=177414104 (unRAID engine)
Oct  2 20:56:33 WhiteNAS kernel: md: correcting parity, sector=180601272 (unRAID engine)
Oct  2 20:57:40 WhiteNAS kernel: md: correcting parity, sector=188342808 (unRAID engine)
Oct  2 20:57:59 WhiteNAS kernel: md: correcting parity, sector=190461816 (unRAID engine)
Oct  2 20:58:09 WhiteNAS kernel: md: correcting parity, sector=191912048 (unRAID engine)
Oct  2 20:58:44 WhiteNAS kernel: md: correcting parity, sector=197019424 (unRAID engine)
Oct  2 20:58:50 WhiteNAS kernel: md: correcting parity, sector=197822128 (unRAID engine)
Oct  2 20:59:13 WhiteNAS kernel: md: correcting parity, sector=200962168 (unRAID engine)
Oct  2 20:59:14 WhiteNAS kernel: md: correcting parity, sector=201022768 (unRAID engine)
Oct  2 20:59:37 WhiteNAS kernel: md: correcting parity, sector=203333160 (unRAID engine)
Oct  2 20:59:39 WhiteNAS kernel: md: correcting parity, sector=203570456 (unRAID engine)
Oct  2 20:59:42 WhiteNAS kernel: md: correcting parity, sector=203891936 (unRAID engine)
Oct  2 20:59:43 WhiteNAS kernel: md: correcting parity, sector=204139648 (unRAID engine)
Oct  2 21:00:07 WhiteNAS kernel: md: correcting parity, sector=206794264 (unRAID engine)
Oct  2 21:00:08 WhiteNAS kernel: md: correcting parity, sector=206965656 (unRAID engine)
Oct  2 21:00:08 WhiteNAS kernel: md: correcting parity, sector=206966552 (unRAID engine)
Oct  2 21:00:50 WhiteNAS kernel: md: correcting parity, sector=211913168 (unRAID engine)
Oct  2 21:00:54 WhiteNAS kernel: md: correcting parity, sector=212318088 (unRAID engine)
Oct  2 21:01:42 WhiteNAS kernel: md: correcting parity, sector=217817384 (unRAID engine)
Oct  2 21:04:50 WhiteNAS kernel: md: correcting parity, sector=242600960 (unRAID engine)
Oct  2 21:05:10 WhiteNAS kernel: md: correcting parity, sector=245311584 (unRAID engine)
Oct  2 21:05:16 WhiteNAS kernel: md: correcting parity, sector=246161512 (unRAID engine)
Oct  2 21:05:30 WhiteNAS kernel: md: correcting parity, sector=248121288 (unRAID engine)
Oct  2 21:06:22 WhiteNAS kernel: md: correcting parity, sector=254165632 (unRAID engine)
Oct  2 21:06:27 WhiteNAS kernel: md: correcting parity, sector=254646608 (unRAID engine)
Oct  2 21:06:32 WhiteNAS kernel: md: correcting parity, sector=255173616 (unRAID engine)
Oct  2 21:06:44 WhiteNAS kernel: md: correcting parity, sector=256425888 (unRAID engine)
Oct  2 21:08:23 WhiteNAS kernel: md: correcting parity, sector=268134328 (unRAID engine)
Oct  2 21:09:23 WhiteNAS kernel: md: correcting parity, sector=275415936 (unRAID engine)
Oct  2 21:10:20 WhiteNAS kernel: md: correcting parity, sector=280538464 (unRAID engine)
Oct  2 21:10:26 WhiteNAS kernel: md: correcting parity, sector=281148680 (unRAID engine)
Oct  2 21:10:45 WhiteNAS kernel: md: correcting parity, sector=282687200 (unRAID engine)
Oct  2 21:11:21 WhiteNAS kernel: md: correcting parity, sector=285932360 (unRAID engine)
Oct  2 21:11:52 WhiteNAS kernel: md: correcting parity, sector=288899952 (unRAID engine)
Oct  2 21:12:27 WhiteNAS kernel: md: correcting parity, sector=292175224 (unRAID engine)
Oct  2 21:12:39 WhiteNAS kernel: md: correcting parity, sector=293091872 (unRAID engine)
Oct  2 21:13:00 WhiteNAS kernel: md: correcting parity, sector=294823776 (unRAID engine)
Oct  2 21:13:12 WhiteNAS kernel: md: correcting parity, sector=295771040 (unRAID engine)
Oct  2 21:13:43 WhiteNAS kernel: md: correcting parity, sector=298326624 (unRAID engine)
Oct  2 21:13:47 WhiteNAS kernel: md: correcting parity, sector=298713776 (unRAID engine)
Oct  2 21:14:02 WhiteNAS kernel: md: correcting parity, sector=300270104 (unRAID engine)
Oct  2 21:14:30 WhiteNAS kernel: md: correcting parity, sector=303481608 (unRAID engine)
Oct  2 21:14:59 WhiteNAS kernel: md: correcting parity, sector=306666128 (unRAID engine)
Oct  2 21:15:24 WhiteNAS kernel: md: correcting parity, sector=309007272 (unRAID engine)
Oct  2 21:15:39 WhiteNAS kernel: md: correcting parity, sector=310472968 (unRAID engine)
Oct  2 21:16:13 WhiteNAS kernel: md: correcting parity, sector=314020832 (unRAID engine)
Oct  2 21:16:19 WhiteNAS kernel: md: correcting parity, sector=314599400 (unRAID engine)
Oct  2 21:16:23 WhiteNAS kernel: md: correcting parity, sector=315134944 (unRAID engine)
Oct  2 21:16:45 WhiteNAS kernel: md: correcting parity, sector=317552704 (unRAID engine)
Oct  2 21:17:21 WhiteNAS kernel: md: correcting parity, sector=321027240 (unRAID engine)
Oct  2 21:17:35 WhiteNAS kernel: md: correcting parity, sector=322641144 (unRAID engine)
Oct  2 21:17:36 WhiteNAS kernel: md: correcting parity, sector=322849720 (unRAID engine)
Oct  2 21:18:08 WhiteNAS kernel: md: correcting parity, sector=326252848 (unRAID engine)
Oct  2 21:18:34 WhiteNAS kernel: md: correcting parity, sector=328774280 (unRAID engine)
Oct  2 21:18:42 WhiteNAS kernel: md: correcting parity, sector=329359776 (unRAID engine)
Oct  2 21:19:00 WhiteNAS kernel: md: correcting parity, sector=330658160 (unRAID engine)
Oct  2 21:19:09 WhiteNAS kernel: md: correcting parity, sector=331294744 (unRAID engine)
Oct  2 21:23:55 WhiteNAS kernel: md: correcting parity, sector=352486200 (unRAID engine)
Oct  2 21:24:54 WhiteNAS kernel: md: correcting parity, sector=356888208 (unRAID engine)
Oct  2 21:24:56 WhiteNAS kernel: md: correcting parity, sector=357000976 (unRAID engine)
Oct  2 21:25:11 WhiteNAS kernel: md: correcting parity, sector=357944816 (unRAID engine)
Oct  2 21:26:16 WhiteNAS kernel: md: correcting parity, sector=364926424 (unRAID engine)
Oct  2 21:26:19 WhiteNAS kernel: md: correcting parity, sector=365265016 (unRAID engine)
Oct  2 21:28:15 WhiteNAS kernel: md: correcting parity, sector=376629904 (unRAID engine)
Oct  2 21:28:52 WhiteNAS kernel: md: correcting parity, sector=380515424 (unRAID engine)
Oct  2 21:29:55 WhiteNAS kernel: md: correcting parity, sector=387537696 (unRAID engine)
Oct  3 00:42:03 WhiteNAS kernel: md: correcting parity, sector=1822278416 (unRAID engine)
Oct  3 02:49:38 WhiteNAS kernel: md: correcting parity, sector=2715374120 (unRAID engine)

 

For those looking for other "gremlins", here's some additional information:

root@WhiteNAS:/# inxi -Fzxxx
System:    Host: WhiteNAS Kernel: 3.9.11p-unRAID i686 (32 bit gcc: 4.4.4)
           Console: tty 0 Distro: Slackware 13.1.0
Machine:   Mobo: ASUSTeK model: M2N-SLI DELUXE
           Bios: Phoenix v: ASUS M2N-SLI DELUXE 1701 date: 10/02/2008 rom size: 512 kB
CPU:       Dual core AMD Athlon 64 X2 4200+ (-MCP-) cache: 1024 KB
           flags: (lm nx pae sse sse2 sse3 svm) bmips: 4018
           clock speeds: min/max: 1000/2200 MHz 1: 1000 MHz 2: 1000 MHz
Graphics:  Card: nVidia G70 [GeForce 7600 GT]
           bus-ID: 03:00.0 chip-ID: 10de:0391
           Display Server: N/A driver: N/A
           tty size: 106x49 Advanced Data: N/A for root out of X
Network:   Card-1: nVidia MCP55 Ethernet
           driver: forcedeth port: b000 bus-ID: 00:08.0 chip-ID: 10de:0373
           IF: eth0 state: up speed: 1000 Mbps duplex: full mac: <filter>
           Card-2: nVidia MCP55 Ethernet
           driver: forcedeth port: ac00 bus-ID: 00:09.0 chip-ID: 10de:0373
           IF: eth1 state: down mac: <filter>
Drives:    HDD Total Size: 7469.5GB (42.3% used)
           ID-1: USB /dev/sda model: Cruzer_Fit size: 8.0GB serial: 4C530006021205103294-0:0
           ID-2: /dev/sdb model: ST3500630AS size: 500.1GB serial: 6QG1BS25
           ID-3: /dev/sdd model: WDC_WD30EZRX size: 3000.6GB serial: WD-WCAWZ1221651
           ID-4: /dev/sdc model: WDC_WD30EFRX size: 3000.6GB serial: WD-WMC4N0439159
           ID-5: /dev/sdf model: WDC_WD6400AARS size: 640.1GB serial: WD-WCAV56654253
           ID-6: /dev/sde model: ST3320620AS size: 320.1GB serial: 5QF15WKQ
Partition: ID-1: /boot size: 7.5G used: 222M (3%) fs: vfat dev: /dev/sda1
Sensors:   System Temperatures: cpu: 28.0C mobo: 34.0C
           Fan Speeds (in rpm): cpu: 3571 psu: 0 sys-1: 2156 sys-2: 0 sys-3: 0 sys-4: 2055
Info:      Processes: 87 Uptime: 9:15 Memory: 174.3/4047.6MB
           Init: SysVinit v: 2.86 runlevel: 3 default: 3 Gcc sys: 4.2.4
           Client: Shell (bash 4.1.72 running in tty 0) inxi: 2.2.14

 

Let me know if you need more or different information about my setup...

Link to comment

Second parity check is complete.  No errors.

 

md5deep done.

 

Here's my plan (looking for comments to see if this is nuts...): 

1) Disable the mover.

2) Copy the 120 G onto the cache.  The command for that should be...

rsync -av /mnt/disk3/ /mnt/cache

(Do I need to worry about duplicates?)

3) Use md5deep to check.  The command for that should be...

md5deep -rnm disk3.2014-10-03.mdsums /mnt/cache

(This should tell me if it couldn't find any of the hashes in the mdsums file... right?)

4) Assuming that I don't get anything back, I'm safe to proceed.  Start in Maintenance Mode.

5) Now, delete the data on disk3 to prep for removal.  Maintain parity.

dd bs=1M if=/dev/zero of=/dev/sdb

sdb is disk3.  Command found here:http://lime-technology.com/forum/index.php?topic=30270.0

6) Parity is good, disk is deleted.  Time to remove it per instructions in link above...

7) Stop Array, go to Utils, click New Config.

8) Assign all drives except Disk 3.

9) Check "Parity is already valid" and start the array.

 

Does this sound about right?

 

Edit:  Realized that without the disks mounted (maintenance mode), I really can't do any of the copying.  Do I have to worry about duplicate files?

Link to comment

I wouldn't do step 5 and 6, I would just go right to step 7.

Start with a new config.

Create it. Calculate a brand new parity anyway.

 

 

This leaves disk3 intact should something go wrong until new parity is created.

We get so paranoid about maintaining parity, it's not worth it in this case because as you write and clear disk 3, parity is already being altered.

 

 

So might as well do a new config and re-create parity from scratch.

Then immediately do a parity check for sanity sake.

The first 4 steps seem to be what I would do.

I just wouldn't do 5 until I've verified parity is good until I recreated it and then verified it.

 

 

Link to comment

Definitely agree with WeeboTech => don't be paranoid about maintaining parity; be more concerned with retaining the data on disk3.

 

You're already done a parity check and have no errors, so there's little risk in doing a New Config and recomputing parity.    If you're really paranoid about doing that, just (a) save a copy of your flash drive;  and (b) use a DIFFERENT parity disk with the new config (saving your current one).    Then if something goes awry you can revert to exactly where you are now with good parity.

 

Link to comment

Awesome! I guess my only hesitation is that we never identified the error source... If it really is one of the other disks, then I could still lose the array. As you note, I'm probably in trouble either way if that is the case...

 

On another note, this has highlighted to me the fragility of Crashplan backups via network drive. I run crashplan on a Windows client, and have a system mapped drive that crashplan backs up.

 

So, what happens, is when the array is offline, crashplan treats the entire drive like it has been deleted. The data is still there, it is just REALLY hard to tell which data is the set you want back, especially if you've renamed files, or moved things around...

Link to comment

It does seem like your backups aren't really reliable, due to the fragility of the Crashplan backup structure.  I'm glad I use a "real" extra server for my backups  :)

 

At this point, it seems you only real option is to do the reconfig you've outlined; then do what checking you can against your backups - but clearly there are a lot of files you can't really verify, so you'll just have to hope they're okay.  At least you now have checksums, so you'll know if anything is modified going forward.

 

 

Link to comment

I do have another question... Is the file structure of the cache drive intuitive? I put the files in the same location, and imagine that they will get moved right...

 

Originally, the media share was like this...

/mnt/disk3/media/<folders>

I put them in the cache like this...

/mnt/cache/media/<folders>

 

That gonna work like I expect?

Link to comment

Here's what I plan to try...

 

http://serverfault.com/questions/315700/how-to-determine-which-file-inode-occupies-a-given-sector

 

Then I can restore individual files from backup, compare checksums, and check to ensure that the file is sound...

 

You will, of course, have to do this for EVERY disk, since there's no way to know, for a given sector number, WHICH disk had the bad data that caused parity to be updated.    It could have truly been bad parity (with no impact on any files); or if it was some other disk, there's no way to know which one.    That's the beauty of having checksums -- you could simply run checksum verifications and confirm everything was good (or know for sure which files were not).

 

Link to comment

I do have another question... Is the file structure of the cache drive intuitive? I put the files in the same location, and imagine that they will get moved right...

 

Originally, the media share was like this...

/mnt/disk3/media/<folders>

I put them in the cache like this...

/mnt/cache/media/<folders>

 

That gonna work like I expect?

 

First, the answer is I don't know.  But more fundamentally, why are you copying files to the cache?    All you need to do is set the share to be cached -- then you just copy the files to the share and UnRAID will automatically put them on the cache; and then move them later when the mover runs.

 

Link to comment

It's primarily because I didn't have enough space elsewhere. Should I just move them directly to an disk that'd part of the share like /mnt/disk1 or do you have something else in mind? I don't have any other disks in the machine, other than those that are a part of the array and cache... I suppose, I could make it "not cache" and mount that drive up separately...

Link to comment

I do have another question... Is the file structure of the cache drive intuitive? I put the files in the same location, and imagine that they will get moved right...

 

Originally, the media share was like this...

/mnt/disk3/media/<folders>

I put them in the cache like this...

/mnt/cache/media/<folders>

 

That gonna work like I expect?

 

First, the answer is I don't know.  But more fundamentally, why are you copying files to the cache?    All you need to do is set the share to be cached -- then you just copy the files to the share and UnRAID will automatically put them on the cache; and then move them later when the mover runs.

 

 

This is fraught with peril if not done correctly due to the bug of copy files from a disk to a user share.

Copying to the cache and then letting the mover move them after the drive is removed is safe.

Copying to another computer then copying back is safe too.

 

There's a bug where copying from a disk to a user share will cause the source files to get truncated unless you totally remove the bad drive from the array and then remount it outside the array. OR you have to rename the directory on the source disk to something new, then copy it to the destination directory. That way they can never be overwritten.

 

 

Cache drive is safe for now. Since you have backups, you can restore if something dies midstream.

Since you've tested your drives multiple times you're probably OK.

Link to comment

Here's what I plan to try...

 

http://serverfault.com/questions/315700/how-to-determine-which-file-inode-occupies-a-given-sector

 

Then I can restore individual files from backup, compare checksums, and check to ensure that the file is sound...

 

Turns out that I don't think this will work with reiserfs.  I can't find the right set of commands to get the filename from the block number, once I've got the block number.  Oh well.  Have the data I want in multiple places.  Time to remove the disk from the set, and watch the sparks fly.  Will let you know how it goes.

 

Link to comment

Parity check came back clear. No errors. I did not have them before this adventure. Building them now for future use. I have spot checked files I moved off of the other drive,  and they all seem good so far. Again, uncertain if that drive was the source. I know some of the sectors that needed correcting were outside the sector count of the drive.

 

Have seen nothing else flash bad at this point. Next steps are to beef up capacity and remove ancient drives.

 

Recommend escalation of potential v6b10a problem.

Link to comment
  • 4 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.