Lost data due to file system corruption - can someone help me understand what happened?


oliver

Recommended Posts

Hi, so im new to unraid. Just setup a 8x12 array with 2 parity drives.  Have about 50 TB of data.  Running unraid 6.6.6 and all drives are xfs-encrypted.

 

Today I went to reshuffle files around to clean up things. I started getting errors in Windows - upon looking at unraid logs, one of my disks was reporting CRC errors and dismounted.  It would come back on a reboot but accessing the same files would cause it to dismount again.  Logs for this event.

 

A google search tells me I need to run xfs_repair, so I do that through the UI and get these logs. Sure enough, the contents of misc stuff are not in the data array and they are in the newly created lost and found share (but corrupted and unusable). I also noticed 3 TB of space more than I had, which tells me I just lost a lot of data. Currently looking through backups since I was shuffling things around, not as clear what I lost yet.

 

1.  What could have caused this and how can I prevent this in the future? I had no power surges, no unclean shutdowns. I had just transferred 50 TB of data and built parity and there were no signs of errors in the logs. The problems didn't begin until I began moving files. Can moving thousands of files cause this kind of error?

Could the disk be bad? No SMART errors reported at any time and all the disks otherwise report healthy. If there is even a possibility of a bad disk, i'd rather find out now and swap it.

 

2.  Could the file originally have been corrupted during transfer from the source? With 50 tb of files and so many of them, it's possible I suppose. It would also explain why on a reboot, it would come back and then only fail when accessing the same, bad folder. But what I don't understand is how it can corrupt the entire filesystem causing a dismount instead of just making an individual file unreadable.

 

3.  When the drive was dismounted, it showed as 'green' in the dashboard and no alert was sent to me. If I hadn't been doing work on it at the time, I wouldn't have even known there was a problem. Why is no alert sent for this?

 

Would appreciate any answers. This is a very disconcerting problem to see in a brand new array and something that parity won't help with. Want to make sure I can reduce the risk of it happening again.

Edited by oliver
Link to comment
Can moving thousands of files cause this kind of error? 

No, if there were no unclean shutdowns the most likely culprit of the initial corruption would be a hardware problem, like bad RAM for example, unless you're using ECC RAM.

 

You should also post the diagnostics, ideally from that period, but if you don't have them post current ones, at worst we can check the hardware you're using and if there are known issues with it.

Link to comment
On 1/14/2019 at 3:49 AM, johnnie.black said:

No, if there were no unclean shutdowns the most likely culprit of the initial corruption would be a hardware problem, like bad RAM for example, unless you're using ECC RAM.

 

You should also post the diagnostics, ideally from that period, but if you don't have them post current ones, at worst we can check the hardware you're using and if there are known issues with it.

 

Sorry for the delay in responding, was gone for the week.  So I came back and recovered the missing files and have been using the array for a few days and it seems to be ok.  

 

I'm wondering about parity though.  How do filesystem errors impact that? 

 

I've been adding data since the event so I assume the parity reflects the current state of things.  Should I do a parity check immediately or just wait until the next scheduled check (Feb 1)?

Link to comment
4 hours ago, oliver said:

I'm wondering about parity though.  How do filesystem errors impact that? 

Parity can't help with filesystem corruption, since it's updated in real-time, any fs corruption will also be on the emulated disk.

 

It's good practice in the beginning to run more frequent parity checks, it can help detect some problems, sync errors should always be 0, unless there was an unclean shutdown.

 

 

Edited by johnnie.black
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.