Randomly corrupted files


Recommended Posts

Hi

I have a 6 disk unraid server on a friend who stores there photos from event's.

There's a lot of photos and a lot of folders.

Recently he found some corrupted files in some folders, about 1 in 10 corrupted, but it seems only the ones created at about 5 months ago, but at the time the files were ok, he said.

Since then many files have been written with no problems at all.

Unraid has no errors and I do a parity check every month.

 

I Think that most files corrupted are on disk3, so I ran a File system check:

 

Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
Leaves 179451
Internal nodes 1136
Directories 635
Other files 70293
Data block pointers 175011400 (1159060 of them are zero)
Safe links 0
###########
reiserfsck finished at Fri Mar 21 17:29:24 2014
###########

/dev/md3 mounted on /mnt/disk3

 

 

then I ran smart status:

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  247  243  021    Pre-fail  Always      -      9641

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      952

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  079  079  000    Old_age  Always      -      15615

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      61

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      33

193 Load_Cycle_Count        0x0032  105  105  000    Old_age  Always      -      287315

194 Temperature_Celsius    0x0022  123  104  000    Old_age  Always      -      29

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  190  000    Old_age  Always      -      50607

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0

 

So as you can see there are a lot of UDMA_CRC_Error_Count... I read about it and problem could be the cables, I'll have to check them later...

For now I would want to recover the corrupt files. Is there a way to do it? By re-syncing the data from the parity drive or something like it?

Link to comment

For now I would want to recover the corrupt files. Is there a way to do it? By re-syncing the data from the parity drive or something like it?

I am afraid parity does not work at the file level.    The only way to recover files is to copy them back on from your backups - if you dot have backups then there is no way to recover.

Link to comment

So as you can see there are a lot of UDMA_CRC_Error_Count... I read about it and problem could be the cables, I'll have to check them later...

 

Replace the cable to this drive as soon as possible!  Make sure it's a good quality one.  Because these are cheap but important, do not ever skimp on SATA cables.

 

There is a small chance that these CRC errors are caused by poor power, so make sure your power supply is a good one, and sufficient for your system needs.

Link to comment

The system I have is:

PENTIUM G850 (2.9GHZ) SKT 1155

ASROCK B75 PRO3-M

KINGSTON KIT 4GB DDR3 1333MHZ

6 WD 2tb green drives connected onboard

pci-e gigabit card

 

syslog: http://pastebin.com/czgAuHyp

 

I don't have is a cache drive, would it prevent this kind of occurence?

It seems that there are not new files being corrupted it seems to be only at that time that some files got corrupted but only today we found that...

Link to comment

The CRC errors are probably unrelated to the corruption too.  They will hurt performance, because the data has to be resent (and resent again if necessary until its good), but they should not cause a file to be corrupted.  Your syslog does not show any issues, apart from one anomaly.  You did a parity check on Feb 28 which ran for 10 hours, then 2 more on Mar 7 and 21 which both ran for 7 hours, all without issue.  It's rather strange that one parity check would take 3 hours longer, with no issues to report, but perhaps you were using the system then?

 

File corruption issues can be hard to figure out, as they are usually very infrequent, and generally no visible errors anywhere.  When you checked your memory, did you let each test run at least 8 hours?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.