[Solved] Disk Rebuild Errors – Do I have Corrupted Data?


Recommended Posts

As a result of Black Friday purchases, I’ve been upgrading disks, and retiring the oldest ones.  I’ve had a few disk problems in the last 9 months, which turned out to be sata cable related.  My plan was to disturb things as little as possible, do all of the disk upgrades, and when things were stable replace all of the sata cables. Bad decision.

 

I replaced disk6, a ST31500341AS 1.5TB, with a 2TB WDC_WD20EARX-00PASB0, and initiated a rebuild.  The result was many disk read errors on disk2.  I canceled the rebuild.  From the errors in the syslog I concluded that all of the errors were related to the sata cables.  I replaced all of the older sata cables.  While I was in the case, I also noticed that the power connector to disk2 was not fully seated, and fixed it.

 

Before continuing, I successfully ran smartctl short disk tests on all disks.

 

I re-initiated the rebuild on disk6.  This time the result was many read errors on disk5, and unraid marked disk5 as missing.  The syslog again indicated to me that the errors were cable/power related.  Disk5 still had one of the older sata cables, and in hindsight, it was on the loose side.  So I replaced the remaining sata cables in the system.

 

At this point, I needed to establish confidence in the hardware.  I re-installed the original disk6, and replaced super.dat with the one from before the first disk6 replacement.  The array was set to not auto-start, and I powered up the hardware.

 

I successfully read over 1GB from each disk with

  dd if=/dev/sdx of=/dev/null bs=65536 count=20000, then initiated a nocorrect  parity check.

 

The hardware seems stable. The results of the parity check were:

49 sync errors within 1 second (housekeeping area?)

  1 sync error sometime later

    3000+  sync errors after sector 2930245632

 

If my calculations are correct, the 3000+ errors all occurred within 16GB of the end of a 1.5TB drive (disk6).  An fdisk of disk6 is attached.

 

My Question - Since the parity disk reflects the rebuild of a 1.5TB disk6 onto a 2.0TB disk6, might the 3000+ errors all reflect the reiserfs housekeeping of increasing the size of the disk? Or do I have corrupted data?

 

In other words, can I run a correcting parity check and be reasonably confident that I have no data corruption?  I have no backups and would l like as much as possible to avoid further corruption.

 

Any suggestions on how to proceed would be greatly appreciated. 

 

I’m thinking that once things are stable, I’ll run a reiserfsck on all of the data drives.

 

An observation – anyone running a server without removable drive bays, that does a fair amount of moving/replacing drives, should strongly consider replacing their sata cables regularly.  The ones I just installed are Monoprice sata3 cables, and they seem more secure than any other cables I’ve used.

 

 

5.0.2RC1, C2SEE, Celeron 1400, 4GB, Corsair VX450, (1) SIL3132 PCIx SATA controller, Intel PCI NIC, 7 drives in total.

Syslogs_etc.zip

Link to comment

In other words, can I run a correcting parity check and be reasonably confident that I have no data corruption?

 

While it's true that virtually all sync errors identified during parity checks are indeed errors on the parity disk (from a variety of possible causes), there is certainly no guarantee of this.    As noted already, unless you have checksums of the files, then without any backups to compare them against, there's no way to KNOW that they're okay.    Nevertheless, at this point you have little choice but to simply run a correcting check.

 

 

I have no backups ...

 

ALL disks fail ... it's just a matter of time.    If your data is important enough that you don't want to lose it, then you should correct this deficiency.  A fault-tolerant system is NOT a backup.

 

If you don't consider your data worth backing up; then you should at least generate checksums on the files, so you have a way of knowing which ones are no longer good when there are problems.

 

Link to comment

dgaschk

  One disk has 4 pending sectors, but they're the same 4 that have been there for years.  And the parity check (with all new cables) shows no hardware errors.

 

garycase

  I will definitely look into check-summing the files.  A backup server is also worth thinking about, as is segregating the really important files so a copy can be taken off-site.

 

I've also been slowly coming to the conclusion that the only way forward is to do a correcting parity check.  And I've realized that 3000 blocks with errors is only about 3MB of data, so damage may be minimal.

 

Thanks guys for the input.

Link to comment

After you've done your correcting parity check;  repeat it and confirm there are no residual sync errors.

 

I would then upgrade to the latest version (just noticed you're not running the latest release) ... and then continue with your drive upgrades.    I agree with dgaschk that the drive with pending sectors is a good candidate for your next replacement.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.