darkside40 Posted August 1 Share Posted August 1 Yesterday i had a problem where the XFS Filessystem of one of my disks went bad. There was some metadata corrupted which i have only recognized because there were files missing and directory listings didnt work correctly. When i have looked in the syslog there were some red lines telling me i have to use xfs_repair to solve that problem. That worked besides of the extra work i had fishing out my files out of the lost+found folder. What i thought is strange is that i didnt get a notification of the bad state of the fs. So i could have maybe worked on for days without taking the neccessary measures. Is there any way to implement a notification when something like this occours? Quote Link to comment
-Daedalus Posted August 2 Share Posted August 2 If this is the case by default that seriously needs changing. Limetech's always said the one thing they care most about with unRAID is the data it holds. Quote Link to comment
darkside40 Posted August 3 Author Share Posted August 3 And thats a serious problem. I am still reconstruction which files got lost etc. I know a Raid is no backup, that correct, but if a filesystem error accours a want to be informed of that. Quote Link to comment
justray2k Posted August 4 Share Posted August 4 I would also like to see this implemented! Happily it didnt happen to me yet but at least a notification is necessary. Quote Link to comment
hi2hello Posted August 4 Share Posted August 4 Wow! That seems to be really bad news. I was assuming that any kind of data corruption would lead to an error notification by default. Would like to see that implemented as I think it is a must-have-feature. Quote Link to comment
unrno.spam Posted August 5 Share Posted August 5 This should be fixed! Just didn't want myself face a situation of corrupted backups because of rotten bits on a hard drive, UnRaid is aware of but do not send any warnings, I do backups every day since I store data which is very important to me - I don't wanna go through hell here... Quote Link to comment
enigmatic-developer2789 Posted August 5 Share Posted August 5 Such things make a parity meaningless and only pretend security. This must be fixed by all means. Quote Link to comment
vakilando Posted August 6 Share Posted August 6 Yes this is a serious problem that should be fixed! Last year one of my 2 SSD from my BTRFS cache pool died and I did not get any notifications. By luck I saw the warnings and errors in the syslog and replaced the disk. I found a script written by mutt that "that" looks for errors in the syslog and then generates an notification. It works well but I don't know if it's compatible to UnRAID 6.12 (I'm still on 6.11). Something like this should (must?) be part of UnRAID.... Quote Link to comment
JorgeB Posted August 6 Share Posted August 6 9 minutes ago, vakilando said: Last year one of my 2 SSD from my BTRFS cache pool died and I did not get any notifications. That's not filesystem corruption. 9 minutes ago, vakilando said: It works well but I don't know if it's compatible to UnRAID 6.12 (I'm still on 6.11). You can use this script for better btrfs or zfs pool monitoring. Quote Link to comment
Squid Posted August 6 Share Posted August 6 FWIW, Fix Common Problems will notify you of corruption, since in the case of corruption the file system will tend to get mounted as read-only and one of FCP's test is to test for that. Quote Link to comment
darkside40 Posted August 7 Author Share Posted August 7 Thanks for the hint with the Syslog Notify Script, didnt know that this exists till now. On the other hand i ask myself why such a useful function is not included in unRaid. 15 hours ago, Squid said: FWIW, Fix Common Problems will notify you of corruption, since in the case of corruption the file system will tend to get mounted as read-only and one of FCP's test is to test for that. At least in my case it didnt work that way. Stopped the array with the metadata corrupted FS, rstarted the array without repairing it and unRaid did not complain. Wonder why there is no word of any official unraid developer till now. Thats a real problem. Quote Link to comment
Dr_Cox1911 Posted August 8 Share Posted August 8 Baffled by this and never thought that something like this could happen with Unraid. I'm not talking about the corruption itself, Unraid can't do anything about it, but the fact that this can happen "silently" where the user doesn't get a notification/warning leaves a really bitter taste. Hope that Limetech addresses this ASAP, it's really their selling point with Data. 1 Quote Link to comment
darkside40 Posted August 10 Author Share Posted August 10 @limetech does any of the official Unraid developers ever take a look in the feature request section? Quote Link to comment
BVD Posted August 19 Share Posted August 19 On 8/10/2023 at 4:16 AM, darkside40 said: @limetech does any of the official Unraid developers ever take a look in the feature request section? They review them fairly regularly, though I suspect they may not respond until they've something of a concrete answer. The way I handle consistency on the array is mult-pronged - Short SMART tests are ran every other day, long tests bi-weekly Running a container called Scrutiny which provides a dashboard for disk health Array parity checks monthly Notifications sent to my phone should there be an error here The Integrity plugin to validate the data on disk matches whats expected It calculates the hash of each file and stores them, then on later runs of the has calculation, generates a notification if the hashes no longer match for whatever reason My expectation is that, if you'd ran a parity check (either manually, or via schedule, you'd have been notified then of the issue. I agree that this is less than ideal in that you'd have the added delay of (however long it is still your next parity check), but at the very least, there is *some* form of check there... I do wish a lot of this was part of the base OS, some kind of sane defaults, then let us change what we want. The fact that there's actually no default schedule for running SMART tests against drives (nor any method to schedule them in the UI actually) is something of a black eye here. I guess I never really thought about it too much, I just kind of 'set everything up', but looking back on it now at least... A lot of this really should be part of the base functionality for *any* NAS OS imo. 1 Quote Link to comment
darkside40 Posted August 24 Author Share Posted August 24 Three weeks, a major problem and no official reaction. Limetech really cares about the data of its users 👍 Quote Link to comment
BVD Posted August 24 Share Posted August 24 5 hours ago, darkside40 said: Three weeks, a major problem and no official reaction. Limetech really cares about the data of its users 👍 Again, I think this would've been caught by a parity check - I can't think of any reason this wouldn't be the case... While I agree that additional checks would be helpful, it seems there *is* actually a 'catch for this in the OS, right? Or am I missing something maybe? I definitely do agree that there's more that could be done to safeguard the data, but also at least want to acknowledge the stuff thats already there of course... Quote Link to comment
JorgeB Posted August 24 Share Posted August 24 It will depend on the filesystem, ZFS for example has no fsck tools, but when there's filesystem corruption the user will usually notice, by for example not being able to write to that disk/pool, or by it being unmountable, also there's not a single way to monitor all supported filesystems, I guess since XFS is the most common the syslog could be monitored for "unmount and run xfs_repair" since that's usually logged when fs corruption is detected, but at that point the filesystem is already corrupt, and any data loss that happens because of repairing it will already be unavoidable. Quote Link to comment
darkside40 Posted August 24 Author Share Posted August 24 Yeah but you could avoid carrying the corruption around for an undefined amount of time if you know about it by simply monitoring the syslog. Like i said the hdd mounted fine in my case so thats not an indicator, and i think most people do a parity check approx. once a month because thats an operations which stresses all the disks in the array. Quote Link to comment
BVD Posted August 31 Share Posted August 31 On 8/24/2023 at 9:37 AM, JorgeB said: ... ZFS for example has no fsck tools... ZFS does have this, it's just referred to as a 'scrub' instead of fsck: zpool scrub poolName You can also use tools like 'zdb' for more granularity/control of the check/repair (scrub), things like specific block failure reporting, running 'offline' on exported pools, etc. ( just notingfor anyone else that might come across this in the future 👍 ) Quote Link to comment
JorgeB Posted August 31 Share Posted August 31 7 hours ago, BVD said: ZFS does have this, it's just referred to as a 'scrub' instead of fsck: A scrub doesn't fix the filesystem, same as btrfs, it checks all data and metadata has the correct checksums, and will correct that if possible, if the pool panics on import a scrub won't fix it. 1 Quote Link to comment
BVD Posted August 31 Share Posted August 31 (edited) I guess I was sorta getting in to symantics 😅 With ZFS, I always think of the pool layout as the FS 'equivalent'. Fsck can only really do 'fsck' because its a journaling filesystem (a feature shared with XFS, which is why it has more of a direct equivalent), while ZFS is transactional (as is BTRFS) - there's no direct eequivalent, namely because the way they commit. No journal, nothing to replay, atomically consistent. ^^ Again, just for clarity should anyone else come across this later. If your pool becomes corrupted, there *are* methods to potentially repair it, but they require manual intervention, and the chances for success vary widely depending on the specific circumstances (pool layout, media type, etc). Edited August 31 by BVD Quote Link to comment
JorgeB Posted August 31 Share Posted August 31 35 minutes ago, BVD said: I guess I was sorta getting in to symantics It's not semantics, for btrfs you have scrub and fsck, scrub like with zfs only checks the data and metadata, it cannot fix filesystem corruption, zfs only has the scrub, no fsck. Quote Link to comment
darkside40 Posted August 31 Author Share Posted August 31 4 weeks no reaction by @limetech etc. Quote Link to comment
BVD Posted August 31 Share Posted August 31 1 hour ago, JorgeB said: It's not semantics, for btrfs you have scrub and fsck, scrub like with zfs only checks the data and metadata, it cannot fix filesystem corruption, zfs only has the scrub, no fsck. Technically correct (the best kind of correct!), but I think if one were speaking to a filesystem layman or lifetime windows user, its the closest equivalent available, and if you were to simply leave it at "zfs has no fsck" without further explanation as above, it may leave them with a potentially undeserved negative sentiment. Either way, once more, this wasnt posted for you specifically, but as an additional breadcrumb trail for others who might come across this through search (etc) to give them some additional search terms to help 🤷 1 hour ago, darkside40 said: 4 weeks no reaction by @limetech etc. As this is a feature request post (and the core issue of corruption has been resolved, at least in this instance), I wouldn't really expect any sort of timeline for a response directly from limetech honestly... Not only is this something thats being requested for a future build, butgiven the size of the team, they've got to focus their efforts in the forums primarily on support and bug fixes, at least during a period where there's still release candidates needing sorted out (the macvlan issue, for example, is one thats been plaguing folks for quite some time, and may finally be getting ironed out - woot!) Several other things I'd say are worth keeping in mind - There's a plugin already available which would help test for this, 'Fix Common Problems', as @Squid mentioned above. UnRAID can't realistically be a 0 maintenance/monitoring system, just as any other, but there are at least tools out there that can help lighten the load of doing so. The older a drive is, the more active monitoring it should have to keep an eye out for such issues - not sure if your signature is still accurate, but all of the drive models mentioned are at least 10+ years old, and even the generation following is now something like 7-8. When disks get anywhere near this kind of run time, the 'standard' recommendations for data protection can't really be applied (e.g. monthly for parity checks and the like) - with anything over 5 years, I'd be running them at least bi-weekly, as disks often fail extremely quickly at that age. As opposed to slowly incrementing errors, I regularly see them pile up massively over a course of hours, maybe days, rarely lasting weeks. Subsequent to this, desktop drives (but most especially older desktop drives) were/are kinda crap at self reporting issues - this seemed especially true 5+ years ago, though it has gotten at least a bit better over the years. Where an enterprise or NAS drive might show themselves as failed before running out of LBAs to reassign bad blocks to, desktop drives would often happily churn along as though nothing happened, corrupting data along the way. I'd check that disks SMART data / reallocated sector count at the very least. Unraid is somewhat unique on the NAS OS front, in that it builds it's array out of multiple disks containing independent file systems (of varying types) as opposed to building an array out of the physical blocks themselves by combining the disks into a virtual disk - given there's no single reporting mechanism at the FS level which would encompass all supported FS types in the array, there's almost certainly some complexity to reporting individual disk corruption out from underneath the array. Like I said though, I'm not disagreeing on any specific point, and in fact agree that UnRAID could do more here - it should at the very least regularly run SMART tests by default, propagating errors up to the notification system, and I do hope the devs find time to integrate this into the OS so I can remove my scripts for them. It would nearly certainly save folks a lot of pain recovering from backups down the line! 1 Quote Link to comment
itimpi Posted September 1 Share Posted September 1 11 hours ago, BVD said: it should at the very least regularly run SMART tests by default This sounds like it is something that might be quite easy to do (at least for SATA drives) in a plugin. Does anyone know if there is already a plugin (or docker)n that attempts this - if not I might look into trying to put together a plugin myself. Having said that I am sure it will end up being harder than it sounds - these things nearly always are. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.