Send notification for file system corruption


Recommended Posts

Yesterday i had a problem where the XFS Filessystem of one of my disks went bad.

There was some metadata corrupted which i have only recognized because there were files missing and directory listings didnt work correctly.

 

When i have looked in the syslog there were some red lines telling me i have to use xfs_repair to solve that problem.

That worked besides of the extra work i had fishing out my files out of the lost+found folder.

 

What i thought is strange is that i didnt get a notification of the bad state of the fs. So i could have maybe worked on for days without taking the neccessary measures.

Is there any way to implement a notification when something like this occours?

Link to comment

This should be fixed! Just didn't want myself face a situation of corrupted backups because of rotten bits on a hard drive, UnRaid is aware of but do not  send any warnings,

I do backups every day since I store data which is very important to me - I don't wanna go through hell here...

 

Link to comment

Yes this is a serious problem that should be fixed!

Last year one of my 2 SSD from my BTRFS cache pool died and I did not get any notifications. By luck I saw the warnings and errors in the syslog and replaced the disk.

I found a script written by mutt that "that" looks for errors in the syslog and then generates an notification. It works well but I don't know if it's compatible to UnRAID 6.12 (I'm still on 6.11).

Something like this should (must?) be part of UnRAID....

 

Link to comment

Thanks for the hint with the Syslog Notify Script, didnt know that this exists till now. On the other hand i ask myself why such a useful function is not included in unRaid.

 

15 hours ago, Squid said:

FWIW, Fix Common Problems will notify you of corruption, since in the case of corruption the file system will tend to get mounted as read-only and one of FCP's test is to test for that.

At least in my case it didnt work that way. Stopped the array with the metadata corrupted  FS, rstarted the array without repairing it and unRaid did not complain.

 

Wonder why there is no word of any official unraid developer till now. Thats a real problem.

Link to comment

Baffled by this and never thought that something like this could happen with Unraid. I'm not talking about the corruption itself, Unraid can't do anything about it, but the fact that this can happen "silently" where the user doesn't get a notification/warning leaves a really bitter taste.

 

Hope that Limetech addresses this ASAP, it's really their selling point with Data.

  • Like 1
Link to comment
  • 2 weeks later...
On 8/10/2023 at 4:16 AM, darkside40 said:

@limetech does any of the official Unraid developers ever take a look in the feature request section?

 

They review them fairly regularly, though I suspect they may not respond until they've something of a concrete answer.

 

The way I handle consistency on the array is mult-pronged -

  • Short SMART tests are ran every other day, long tests bi-weekly
    • Running a container called Scrutiny which provides a dashboard for disk health
  • Array parity checks monthly
    • Notifications sent to my phone should there be an error here
  • The Integrity plugin to validate the data on disk matches whats expected
    • It calculates the hash of each file and stores them, then on later runs of the has calculation, generates a notification if the hashes no longer match for whatever reason

My expectation is that, if you'd ran a parity check (either manually, or via schedule, you'd have been notified then of the issue. I agree that this is less than ideal in that you'd have the added delay of (however long it is still your next parity check), but at the very least, there is *some* form of check there...

 

I do wish a lot of this was part of the base OS, some kind of sane defaults, then let us change what we want. The fact that there's actually no default schedule for running SMART tests against drives (nor any method to schedule them in the UI actually) is something of a black eye here. I guess I never really thought about it too much, I just kind of 'set everything up', but looking back on it now at least... A lot of this really should be part of the base functionality for *any* NAS OS imo.

  • Like 1
Link to comment
5 hours ago, darkside40 said:

Three weeks, a major problem and no official reaction.

Limetech really cares about the data of its users 👍

 

Again, I think this would've been caught by a parity check - I can't think of any reason this wouldn't be the case... While I agree that additional checks would be helpful, it seems there *is* actually a 'catch for this in the OS, right? Or am I missing something maybe?

 

I definitely do agree that there's more that could be done to safeguard the data, but also at least want to acknowledge the stuff thats already there of course...

Link to comment

It will depend on the filesystem, ZFS for example has no fsck tools, but when there's filesystem corruption the user will usually notice, by for example not being able to write to that disk/pool, or by it being unmountable, also there's not a single way to monitor all supported filesystems, I guess since XFS is the most common the syslog could be monitored for "unmount and run xfs_repair" since that's usually logged when fs corruption is detected, but at that point the filesystem is already corrupt, and any data loss that happens because of repairing it will already be unavoidable.

Link to comment

Yeah but you could avoid carrying the corruption around for an undefined amount of time if you know about it by simply monitoring the syslog.

 

Like i said the hdd mounted fine in my case so thats not an indicator, and i think most people do a parity check approx. once a month because thats an operations which stresses all the disks in the array.

Link to comment
On 8/24/2023 at 9:37 AM, JorgeB said:

... ZFS for example has no fsck tools...

 

ZFS does have this, it's just referred to as a 'scrub' instead of fsck:

zpool scrub poolName

 

You can also use tools like 'zdb' for more granularity/control of the check/repair (scrub), things like specific block failure reporting, running 'offline' on exported pools, etc.

 

( just notingfor anyone else that might come across this in the future 👍 )

Link to comment

I guess I was sorta getting in to symantics 😅

 

With ZFS, I always think of the pool layout as the FS 'equivalent'. Fsck can only really do 'fsck' because its a journaling filesystem (a feature shared with XFS, which is why it has more of a direct equivalent), while ZFS is transactional (as is BTRFS) - there's no direct eequivalent, namely because the way they commit. No journal, nothing to replay, atomically consistent.

 

^^ Again, just for clarity should anyone else come across this later. If your pool becomes corrupted, there *are* methods to potentially repair it, but they require manual intervention, and the chances for success vary widely depending on the specific circumstances (pool layout, media type, etc).

Edited by BVD
Link to comment
1 hour ago, JorgeB said:

It's not semantics, for btrfs you have scrub and fsck, scrub like with zfs only checks the data and metadata, it cannot fix filesystem corruption, zfs only has the scrub, no fsck.

    

Technically correct (the best kind of correct!), but I think if one were speaking to a filesystem layman or lifetime windows user, its the closest equivalent available, and if you were to simply leave it at "zfs has no fsck" without further explanation as above, it may leave them with a potentially undeserved negative sentiment. Either way, once more, this wasnt posted for you specifically, but as an additional breadcrumb trail for others who might come across this through search (etc) to give them some additional search terms to help 🤷

 

1 hour ago, darkside40 said:

4 weeks no reaction by @limetech etc.

          

As this is a feature request post (and the core issue of corruption has been resolved, at least in this instance), I wouldn't really expect any sort of timeline for a response directly from limetech honestly... Not only is this something thats being requested for a future build, butgiven the size of the team, they've got to focus their efforts in the forums primarily on support and bug fixes, at least during a period where there's still release candidates needing sorted out (the macvlan issue, for example, is one thats been plaguing folks for quite some time, and may finally be getting ironed out - woot!) 

      

Several other things I'd say are worth keeping in mind - 

  • There's a plugin already available which would help test for this, 'Fix Common Problems', as @Squid mentioned above. UnRAID can't realistically be a 0 maintenance/monitoring system, just as any other, but there are at least tools out there that can help lighten the load of doing so.
  • The older a drive is, the more active monitoring it should have to keep an eye out for such issues - not sure if your signature is still accurate, but all of the drive models mentioned are at least 10+ years old, and even the generation following is now something like 7-8. When disks get anywhere near this kind of run time, the 'standard' recommendations for data protection can't really be applied (e.g. monthly for parity checks and the like) - with anything over 5 years, I'd be running them at least bi-weekly, as disks often fail extremely quickly at that age. As opposed to slowly incrementing errors, I regularly see them pile up massively over a course of hours, maybe days, rarely lasting weeks.
  • Subsequent to this, desktop drives (but most especially older desktop drives) were/are kinda crap at self reporting issues - this seemed especially true 5+ years ago, though it has gotten at least a bit better over the years. Where an enterprise or NAS drive might show themselves as failed before running out of LBAs to reassign bad blocks to, desktop drives would often happily churn along as though nothing happened, corrupting data along the way. I'd check that disks SMART data / reallocated sector count at the very least.
  • Unraid is somewhat unique on the NAS OS front, in that it builds it's array out of multiple disks containing independent file systems (of varying types) as opposed to building an array out of the physical blocks themselves by combining the disks into a virtual disk - given there's no single reporting mechanism at the FS level which would encompass all supported FS types in the array, there's almost certainly some complexity to reporting individual disk corruption out from underneath the array.

Like I said though, I'm not disagreeing on any specific point, and in fact agree that UnRAID could do more here - it should at the very least regularly run SMART tests by default, propagating errors up to the notification system, and I do hope the devs find time to integrate this into the OS so I can remove my scripts for them. It would nearly certainly save folks a lot of pain recovering from backups down the line!

  • Like 1
Link to comment
11 hours ago, BVD said:

it should at the very least regularly run SMART tests by default

This sounds like it is something that might be quite easy to do (at least for SATA drives) in a plugin.  Does anyone know if there is already a plugin (or docker)n that attempts this - if not I might look into trying to put together a plugin myself.   Having said that I am sure it will end up being harder than it sounds - these things nearly always are.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.