Jump to content
Sign in to follow this  
oliver

Lost data due to file system corruption - can someone help me understand what happened?

4 posts in this topic Last Reply

Recommended Posts

Hi, so im new to unraid. Just setup a 8x12 array with 2 parity drives.  Have about 50 TB of data.  Running unraid 6.6.6 and all drives are xfs-encrypted.

 

Today I went to reshuffle files around to clean up things. I started getting errors in Windows - upon looking at unraid logs, one of my disks was reporting CRC errors and dismounted.  It would come back on a reboot but accessing the same files would cause it to dismount again.  Logs for this event.

 

A google search tells me I need to run xfs_repair, so I do that through the UI and get these logs. Sure enough, the contents of misc stuff are not in the data array and they are in the newly created lost and found share (but corrupted and unusable). I also noticed 3 TB of space more than I had, which tells me I just lost a lot of data. Currently looking through backups since I was shuffling things around, not as clear what I lost yet.

 

1.  What could have caused this and how can I prevent this in the future? I had no power surges, no unclean shutdowns. I had just transferred 50 TB of data and built parity and there were no signs of errors in the logs. The problems didn't begin until I began moving files. Can moving thousands of files cause this kind of error?

Could the disk be bad? No SMART errors reported at any time and all the disks otherwise report healthy. If there is even a possibility of a bad disk, i'd rather find out now and swap it.

 

2.  Could the file originally have been corrupted during transfer from the source? With 50 tb of files and so many of them, it's possible I suppose. It would also explain why on a reboot, it would come back and then only fail when accessing the same, bad folder. But what I don't understand is how it can corrupt the entire filesystem causing a dismount instead of just making an individual file unreadable.

 

3.  When the drive was dismounted, it showed as 'green' in the dashboard and no alert was sent to me. If I hadn't been doing work on it at the time, I wouldn't have even known there was a problem. Why is no alert sent for this?

 

Would appreciate any answers. This is a very disconcerting problem to see in a brand new array and something that parity won't help with. Want to make sure I can reduce the risk of it happening again.

Edited by oliver

Share this post


Link to post
Can moving thousands of files cause this kind of error? 

No, if there were no unclean shutdowns the most likely culprit of the initial corruption would be a hardware problem, like bad RAM for example, unless you're using ECC RAM.

 

You should also post the diagnostics, ideally from that period, but if you don't have them post current ones, at worst we can check the hardware you're using and if there are known issues with it.

Share this post


Link to post
On 1/14/2019 at 3:49 AM, johnnie.black said:

No, if there were no unclean shutdowns the most likely culprit of the initial corruption would be a hardware problem, like bad RAM for example, unless you're using ECC RAM.

 

You should also post the diagnostics, ideally from that period, but if you don't have them post current ones, at worst we can check the hardware you're using and if there are known issues with it.

 

Sorry for the delay in responding, was gone for the week.  So I came back and recovered the missing files and have been using the array for a few days and it seems to be ok.  

 

I'm wondering about parity though.  How do filesystem errors impact that? 

 

I've been adding data since the event so I assume the parity reflects the current state of things.  Should I do a parity check immediately or just wait until the next scheduled check (Feb 1)?

Share this post


Link to post
4 hours ago, oliver said:

I'm wondering about parity though.  How do filesystem errors impact that? 

Parity can't help with filesystem corruption, since it's updated in real-time, any fs corruption will also be on the emulated disk.

 

It's good practice in the beginning to run more frequent parity checks, it can help detect some problems, sync errors should always be 0, unless there was an unclean shutdown.

 

 

Edited by johnnie.black

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this