unRAID6-beta7/8 POSSIBLE DATA CORRUPTION ISSUE: PLEASE READ


limetech

Recommended Posts

was the size reported via ls the same size reported by the application that read it?

vi or something else?

 

 

I'm trying to determine if the metadata or stat information is intact enough to use for a comparison.

 

I'm sorry, I wasn't thinking in terms of an OS bug at that time - I was just experimenting with the dockerised LMS which reported an error, I investigated, found the source file was incomplete and re-copied it.  I didn't investigate any further.

Link to comment
  • Replies 239
  • Created
  • Last Reply

Top Posters In This Topic

I strongly agree with this.  Let's stay at least a release or 2 behind.  Let others bleed.

 

Unfortunately this won't work.  Limetech is committed to releasing v6.0 incorporating two 'bleeding edge' technologies - docker and btrfs.  It is the support of these which dictates the need to incorporate a very recent kernel.

Link to comment

We are finding the same.  The biggest areas of impact seem to be metadata related to plugins / containers that are of small size.  Larger files (media content) seem to be unaffected, but I cannot claim that for certain at this point, just by the nature of how things seem to be working fine for me on video (not a single video-related problem yet, but I have a lot of content to test  :-\ )

 

My biggest concerns are for my email and my photos.  Each mail is held in a separate file, so each file is quite small.  Of course, most email gets read when it arrives, so corruption of new files would almost certainly get noticed.  Corruption of older files may never be discovered.

 

Regarding my photos, the only writes to that drive have been a small number (half a dozen?) of video (large) files, so I'm probably okay there.

 

My server is still running - I have stopped all applications and am continually monitoring the write counts in the emhttp interface.  I have redirected writes from my two docker applications to an xfs drive, and re-enabled them.  So deluge and LMS both continue to run.

 

This seems to be a huge advantage for docker - each application can be controlled individually.  I haven't dared to start up a VM because there would be no way of catching the applications before they write to the array.

Link to comment

It's made very clear in the OP that this is beta software and some features won't work, and there may be bugs.  Everyone that installed a beta is taking on the risk for themselves.  If you can't stomach these kinds of problems, you should not be installing a beta, especially on a production machine where loss of data is a problem.

 

It looks to me that some have become a little complacent because LT does such a fabulous job at wringing out issues before a beta is released, and confidence levels are pretty high.  While this is good for those beta testing, beta software should not be used in a situation where loss of data would be a problem.

 

LT has to deal with the new versions of Linux that are not fully wrung out.  They want to include the latest features of Linux and deliver the best product possible, but there is the risk like in this case where a bug creeps in that they don't have any control over.

 

I get very concerned when I read in the beta posts when people ask "Is this beta safe" and some answer "No problem".  No beta is ever "safe" and I don't think this recommendation is appropriate, especially when I see newer members of the forum ask.  They should not be installing beta software unless they just want to play and check the features.

 

I think LT should re-think the beta program and close it down to the general public and have a select few beta testers.  I know I am opening myself to some criticism, but as a plugin maintainer, I find it extremely difficult to work on plugins for a beta and then get all sorts of support questions that are not appropriate for beta software.

 

Although I agree in general terms - this issue is particularly concerning.

 

In my mind it was unRAID that was beta - NOT Linux. I would never install a beta OS. We knew that there were specific enhancement areas and other areas that were not altered in beta7 or beta8. We might expect Docker instability and issues with XFS and BTRFS file systems. I could even see issues with parity being maintained properly with these new file systems. I would never have expected that something as basic as the ReiserFS would be in question.

 

Live and learn. I will say that Tom's beta features have been solid. It is Linux that has the problem.

Link to comment

I'm not a RFS hater because it's really been proven a great file system and recovery has been outstanding to say the least.

 

However this isn't the first time this type of thing has happened with RFS.

Remember this?  this was a code change to make things more better also... streamline... and was causing corruption.  Tom had to do a intermediate fix until the RFS developers fixed it in the kernel.

 

http://lime-technology.com/forum/index.php?topic=28231.0

 

I do think that jonp "hunch" (see below) may be coming sooner that we may think.  I don't think that RFS get's the same attention as other file systems in the linux kernel -- based on the last two serious instances (that I know of).

 

http://lime-technology.com/forum/index.php?topic=34783.0

"reiserfs vs. xfs for array devices

I need to do more and more of this testing, but generally speaking, I think it's fairly clear that xfs vs. reiserfs is more about a making a chess move now that will play out in the longer term to our advantage.  Quite simply:  we believe that if we fast-forward the clock, sooner or later there will be a point in time where xfs is going to have advantages for users over reiserfs.  Call it a hunch, an educated guess, a prediction...whatever.  We just really think resierfs is on it's way out the door.  The bottom line is that for array devices, I would suggest migrating away from reiserfs as users find the opportunity to do so.  Its not a rush...  Its not going to break...  Its been very stable...  That said, think chess moves...  In addition ALL cache pool devices should be BTRFS for now in my opinion unless you're never planning to expand past a single unprotected cache disk.  If you don't have a secondary cache device yet, you can straggle along, but if you want to use a cache pool, you will need two btrfs disk devices assigned."

 

Here is some polling data that bjp999 put together.  I think this is saying something.

XFS vs RFS vs BTRFS polling.

http://lime-technology.com/forum/index.php?topic=34776.0

 

I personally have switched over to XFS all the way for these reasons, although I not suggesting we simply abandon RFS immediately.

Link to comment

So...  Funny story...  This weekend I'm listening to some Taj Mahal MP3's, a 3-CD set.  And, I noticed that the latter half of 2-3 tracks per album changed to Elton John.  I ripped these CD's in 2008.  I know I've listened to the MP3's since then and I don't ever remember this strange behavior.  I tried a different player...  Same result.  I checked the file dates...  Still 2008.  I'm very confused. 

 

And, then, today I see this.  I'm very sad right now.  I have so many files on my server I think the question about what exactly is corrupt will out-live me.

 

This is a fail.

Link to comment

As a matter of interest, mention was made of BTRFS having data integrity capabilities (checksum, scrubbing, etc.) Would those have caught the corruption real time, or would the kernel nature of it bypassed these?

 

Or in other words, when I finally think 6 is stable enough for me to move across the production system (and boy am I happy I've held off), would BTRFS have advantages to XFS after all?

Link to comment

As a matter of interest, mention was made of BTRFS having data integrity capabilities (checksum, scrubbing, etc.) Would those have caught the corruption real time, or would the kernel nature of it bypassed these?

 

Or in other words, when I finally think 6 is stable enough for me to move across the production system (and boy am I happy I've held off), would BTRFS have advantages to XFS after all?

 

 

The checksum feature of BTRFS is realtime, you should know there is a problem.

I think you can only fix it from a snapshot or a manual restore from your backup. Still knowing a problem exists is a bug plus.

 

 

You will need to check the dmesg queue manually and then search out the inodes.

Do a search on it and you will get an idea of what is available.

Link to comment

I'm not a RFS hater because it's really been proven a great file system and recovery has been outstanding to say the least.

...

I personally have switched over to XFS all the way for these reasons, although I not suggesting we simply abandon RFS immediately.

 

Just curious, but why isn't anyone talking about ext4?

 

although technical and a little old, this was a good watch.

Link to comment

The checksum feature of BTRFS is realtime, you should know there is a problem.

I think you can only fix it from a snapshot or a manual restore from your backup. Still knowing a problem exists is a bug plus.

Hmm, well if it were on write presumably you'd be getting errors as you copied a file from elsewhere, giving you the option to check/retry. And as so far as corruption of existing files are concerned, the first issue on read would probably have you scrubbing the lot. So problems would seem to be contained in this scenario.

 

If only BTRFS were a bit more mature in the multiple disk scenarios.

Link to comment

As a matter of interest, mention was made of BTRFS having data integrity capabilities (checksum, scrubbing, etc.) Would those have caught the corruption real time, or would the kernel nature of it bypassed these?

 

Or in other words, when I finally think 6 is stable enough for me to move across the production system (and boy am I happy I've held off), would BTRFS have advantages to XFS after all?

 

 

The checksum feature of BTRFS is realtime, you should know there is a problem.

I think you can only fix it from a snapshot or a manual restore from your backup. Still knowing a problem exists is a bug plus.

 

I would assume you would potentially know there is an issue with BTRFS if you are copying from one drive to another, but if you are copying to UnRAID for the first time (either from an external computer or download from the Internet) and it's written incorrectly on first write then I wouldn't think anything can catch the corruption as there is no "good" copy to compare against, right?

 

 

Link to comment

The checksum feature of BTRFS is realtime, you should know there is a problem.

I think you can only fix it from a snapshot or a manual restore from your backup. Still knowing a problem exists is a bug plus.

Hmm, well if it were on write presumably you'd be getting errors as you copied a file from elsewhere, giving you the option to check/retry. And as so far as corruption of existing files are concerned, the first issue on read would probably have you scrubbing the lot. So problems would seem to be contained in this scenario.

 

If only BTRFS were a bit more mature in the multiple disk scenarios.

 

I would only think you'd know if there was an error if the source file was on btrfs. If it's coming from NTFS or the Internet you don't have a version to compare to.

 

Also, even though everyone is in panic mode as this is the first real risk to data in the 3+ years I've been here I would still think RFS is potentially more reliable than XFS or BTRFS - in the sense that it's not a currently developed system (except whatever screw up some brainiac did this time). With XFS and BTRFS being actively developed file systems you have a greater chance of something major screwing up there (figure more changes = more risk). RFS had been static for quite a while from what I understand, so the likelihood of introducing a new issue like this is pretty rare.

 

Granted XFS and BTRFS have better checksum mechanisms in place, but the more you have people mucking with something the higher the potential risk of someone screwing it up.

Link to comment

RFS had been static for quite a while from what I understand, so the likelihood of introducing a new issue like this is pretty rare.

 

I remember RC's of unRAID 5 uncovered a problem with ReiserFS and the superblock not being updated correctly.

In that scenario there was a way of loosing data if you had a sudden power loss.

From what I remember, it didn't matter how long the system was idle for. It was the superblock that wasn't getting updated until a hard sync was called.

 

I'm not in the camp that issues with ReiserFS are rare anymore.

I think we all need to put our brains together to come up with a really solid test and validation suite as insurance for the future.

That would be candidate for another thread entirely.

Link to comment

any update on b9  ::)

 

LT said it would only take few hours yesterday. Now we are into the next day.

 

LT never ceases to amaze me. They are consistent though, I will give them that.

 

They never said it was a few hours... they said they were in the process of testing beta9 and would release as soon as they can.

 

At this point the last thing we need is a beta rushed out the door 1/2 tested which could cause more issues.

 

We are all anxiously waiting for beta9, but need to give LT the time to test it properly so we can have confidence it's safe and we can resume normal operations on our UnRAID servers.

Link to comment

Interesting thoughts and is a good read. Out of curiosity, why do you have a create a new id to voice your opinion?

 

.....

BTRFS is "fine"for Docker, system drives because both are easily restored and you do not store vaulable data on them. However, your mission critical data that you do not want to lose (even the data you value from your Docker Applications ) should not sit on BTRFS now or anytime in the near future.

 

Bottom line, LT should NEVER EVER use any Linux Kernel (even with Beta Versions of unRAID) that does not have Long Term Support. We should be on Linux Kernel 3.14.18 at worst and probably should really be either on the 3.10 or 3.12 series. Simply put, there is not a Company who values their data or NAS / SANS Manufacturer on the planet who would run their business or think of doing what unRAID with Linux Kernels. It's careless, reckless and flat out stupid.

Link to comment

Interesting thoughts and is a good read. Out of curiosity, why do you have a create a new id to voice your opinion?

 

It sounds very much like a former contributor (username GrumpyButFun) who was asked to leave the forum. IMHO, he often has good technical insight, but his tone is frequently belligerent and counterproductive.

Link to comment

I understand why you would think this but your theory is flawed. Let me explain why...It's careless, reckless and flat out stupid.

 

Grumpy, is that you?

 

I could be mistaken, but I'm pretty sure you advocated to get on the latest kernel, because of the older kernels having known bugs.

 

That was my first thought too... Hmm.... that voice/tone sounds awfully familiar! :)

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.