July 18, 200817 yr Thread continuation from here. Two beers and a movie later ... <snip> Looking at safety, we really need to closely consider how users have actually lost data, and address these high likelihood, high impact problems. An interesting and valid approach to this problem. 1. Pressing the restore button at the wrong times. This has bitten more users in the butt than any other issue. No argument from me here. Currently, I like "Delete Disk Configuration" with instructions something like "Pressing this button will remove the current system.dat configuration file and create a new one based on the currently assigned and working disks. I like the idea of it first going to a new web-page with additional instructions, warnings, and another "I'm sure" checkbox. I like the idea of only enabling the button if a disk has been marked for removal from the array. Perhaps a new value is needed on the devices page "Delete Disk from Configuration" It would be equivalent to "unassigned" but unless chosen for a currently assigned disk, the "Delete Dsk configuration" button would not be enabled. Just about any other need for this button can be handled by asking the user to delete or rename the system.dat file from the flash drive. 2. Problem rebuilding a failed (up upgraded) disk. One bad sector on any other disk will lead to lost data. Doing a parity check routinely is an easy precaution that few users know to do. An ability to schedule these would be an important safety feature. A post just today indicated a user that had not done a parity check since April, and he had over 1000 parity errors. How many other users think they are protected but aren't. Big safety exposure here!An option to schedule this on a weekly/bi-weekly/monthly basis is a great idea. It takes very little time to impliment, and will help many discover issues before they are forced to recover their data. 3. At least one user accidentally assigned a data disk to the parity slot by accident. Deadly. GUI changes to make this hard to do would be a good safety feature.Something like invoking vol_id on the potential "partition" would help here. If any file-system is detected ( reiserfs, ext3, ntfs, fat) then a warning and additional "I'm sure" checkbox is warranted. Here is the output on my parity drive (sdb1), and one of my data drives (hda1): root@Tower:/boot# vol_id /dev/sdb1 /dev/sdb1: unknown volume type root@Tower:/boot# vol_id /dev/hda1 ID_FS_USAGE=filesystem ID_FS_TYPE=reiserfs ID_FS_VERSION=3.6 ID_FS_UUID=f0241f95-552b-4488-95b8-28488e09873e ID_FS_UUID_SAFE=f0241f95-552b-4488-95b8-28488e09873e ID_FS_LABEL= ID_FS_LABEL_SAFE= 4. People don't know how hot is too hot and run their disks at or near the edge of the temperature envelop. Perhaps having unRAID color code the temps on the main tab to blue (cool), green (good), yellow (warm), or red (hot) would let users know they have a problem. Great idea. Easy to impliment for Tom, and much easier on inexperienced users to determine if a problem exists in their array. Fan failures are very rare, but setting up hot-running unRAID servers is not. I respectfully disagree here. Fan failures are common. Especially after they are spinning for several years. Normally fans start making noise before they completely fail, but since we often have our servers in out-of-the-way places, the noise might be not be noticed. I'm glad this is high on Tom's radar and look forward to these features being added to unRAID in the not too distant future. Anyone about to order multiple systems, please tell Tom to move this feature to the top of the list! I agree with your approach. Learn from where most of the support threads require assistance. I had an idea overnight. I think that unRAID should do some type of abbreviated parity check before the array is allowed to start. This parity check would NOT fix anything - it would just go looking and see if there are parity error conditions waiting to occur. If it finds parity problems, it should not start the array, but instead the Web GUI should give the user the results (xx parity errors were detected) and give the user options on how to proceed. (e.g., start the array but perform a parity check, rebuild parity, or don't start the array - allowing the user to make configuration changes to fix the problem and then try again) The algorithm might check the system areas (the beginning of the disk), and check some random sectors within the data areas. For example, check 256K of data at some random location within the first 1G, within the second 1G, etc. On a 1T array this would result in parity checking 250M worth of data (+ the system area) - a pretty good representative sample of the disk. This should be pretty fast, but even if Tom decided to do more, a couple minutes at array startup is not a big deal. This feature would detect issues where drives had not been cleared properly (a bug that was fixed but some users may not know if they added a disk while the bug was in place), as well as help users that don't know their parity is hosed for some other reason. If anyone used the mdcmd set invalid=99 command, this check would make the system at least give a cursory view of array integrity before allowing the user to blindly say my array is valid and, if wrong, start a flurry of updates to the supposed parity disk. By doing it without attempting any form of correction, this is a safe test even if the wrong drive is in the parity slot, or something else is terribly wrong with the array.
July 18, 200817 yr I'm not so keen on a parity check before array start. None of the other linux raid drivers do this. They check a superblock, if an issue is detected, they do a resync (which is a check or fix). I don't want anything delaying array startup. What I think is worth while is the recommended "a read-verify pass, 1 or 2 drives at a time". In this case, doing a read/parity verify of one drive at a time can be done on a scheduled basis. (yet so can a periodic parity check). Besides, I do not think a weekly parity check is that bad of a situation. In fact I'm pretty surprised that it's needed A swift parity verify per drive is what i would elect. Yet I think this requires spinning up all drives anyway.
July 18, 200817 yr To expand on my prior comment: Rather than the weekly parity check which requires spinning up all drives for an extended time, a read-verify pass, 1 or 2 drives at a time, would be less taxing, and much less impact on regular use of the server, and a good PM test to discover bad sectors. Also, I don't recall whether there is an ability to do a read-only parity test, rather than the full parity rebuild.... the former would likely be faster, and sufficiently protective for periodic use between full parity rebuilds. I see two flavors of suggestions here... things that can ONLY be implemented w/in unRAID, and things that CAN be implemented externally. The former are up to Tom. For the latter, I suggest that if it CAN be done externally (i.e. temperature monitoring, notifications, etc.) that there is enough talent and motivation in this forum to do it. Scripts are OK for proof of concept, and to provide the function of the application... but a user interface is needed for users who don't know telnet from a teletubby. HTTP/PHP seems the logical solution.... but the absence of an HTTP server in unRAID leaves folks looking back to the unRAID emhttp interface. I am biased, and always use Apache... I have a simple (to me) installation procedure for Apache on unRAID, bit I'll accept that it is not that simple for some folks. Perhaps there are other light-weight HTTP servers out there... I haven't looked since I have Apache. It seems to me that external feature modules for unRAID like temp monitoring, are held back by the user interface issue. Perhaps we can find some light-weight alternatives to Apache, and either make recommendations to Tom or find something easier than Apache to add to the stock unRAID build. A corallary to this is the install procedure for unRAID addins. To serve the large number of non-Linux experts, I suggest that the install procedure should be done in Windows (like the unRAID install itself), by adding the necessary files to the flash, and editing appropriate scripts on the flash. I'm not saying that all data must be kept on the flash... just the installation procedure. For example, on first execution, an app could create its data files on a hard disk if desired.
July 18, 200817 yr > HTTP/PHP seems the logical solution.... but the absence of an HTTP server in unRAID leaves folks looking back to the unRAID emhttp interface. > It seems to me that external feature modules for unRAID like temp monitoring, are held back by the user interface issue. Perhaps we can find some light-weight alternatives to Apache, and either make recommendations to Tom or find something easier than Apache to add to the stock unRAID build. This is a huge hindrance. I think Apache or Lighttpd is the way to go. Over the next week or so I will be attempting scripted installs of one or the other using a proxy module to present emhttp in one of the menus. This way it will seem - seemless and possibly allow us to add our own "drop in features". > I suggest that the install procedure should be done in Windows (like the unRAID install itself), by adding the necessary files to the flash, and editing appropriate scripts on the flash. I'm not saying that all data must be kept on the flash... just the installation procedure. For example, on first execution, an app could create its data files on a hard disk if desired. Agreed, but part of this is also held back by lack of blessing in this thread. http://lime-technology.com/forum/index.php?topic=1953.0 Really once we have a way to "drop in" Packages and "drop in" startup scripts, the rest will be easier. I'm actually very frustrated by this. I have a whole architecture where I can literally drop in a package and reboot and it's installed. What I want to do is add a web page whereby you can upload the package, get a list of current pacakges, enable/disable them and also schedule startup scripts which handle secondary install initialization. I.E. syncing of config files and such. This one of the reasons a request for roadmap was made. Just so we know where things are headed and those of us who can lend a hand.. are able to without wasting effort. We saw how that request seemed to ruffle a few feathers. Now I'm not trying to call out or cause trouible, or troll. I think what is proposed is a very sound and solid suggestion. In some part I don't know how we can do this without stepping on someones toes. The go script and drop in directory architecture needs to be refined. It all starts there. As far as webserver, we need input/direction. If Tom plans to add CGI capability to emhttp, then we wait. If this is something that is not on the horizon, then we can use apache or lighttpd and a proxy module to proxy emhttp. Thereafter we come to scripting facilities. If you can provide your drop in mechanism for apache, I can possibly build off it.
July 18, 200817 yr Author WeeboTech - we're destined to be on opposite sides of these types of discussions. Fundamentally you are looking at unRAID as a high availability platform with good reliability, I am looking at unRAID as a high reliability platform with good availability. Realizing this difference in perspective, Tom will make the ultimate call on what, if anything, he wants as unRAID features. I'm not so keen on a parity check before array start. None of the other linux raid drivers do this. They check a superblock, if an issue is detected, they do a resync (which is a check or fix). I don't think comparing RAID and unRAID is appropriate. RAID arrays are statically defined and operate as one. unRAID can be dynamically grown and shrunk, and operate as many. If a RAID array is out of sync, you are totally hosed. If unRAID gets out of sync, you can operatie for a VERY long time and not know it. Very different. I don't want anything delaying array startup. I don't want anything delaying array starup either. But safety features have tradeoffs in terms of power and convenience. I think if done properly, the delay would be minimal and the value (in saved users' data) more than worthwhile. What I think is worth while is the recommended "a read-verify pass, 1 or 2 drives at a time" In this case, doing a read/parity verify of one drive at a time can be done on a scheduled basis. (yet so can a periodic parity check). . By its very nature, you can't verify parity on a single (or a couple) of drives. All you could do is a read test, and unRAID could correct read errors with its normal cycle. (A correction would require spinning up the other drives.) This would help make sure that bad sectors get detected and remapped, but won't tell users if their parity is out of whack. I'd rather just do the whole thing at once, but am okay with however Tom decides. Since the parity check is already there, and all that you'd need to do is add a way of scheduling it, I think doing it that way may be a better use of Tom's time. Besides, I do not think a weekly parity check is that bad of a situation. In fact I'm pretty surprised that it's needed These two statements are confusing together. I think you might almost be agreeing that a periodic parity check might be a good thing, but having a hard time getting the words "I agree" out :D I think weekly is a little too frequent - I'd do it monthly. Whatever floats you boat. A swift parity verify per drive is what i would elect. Yet I think this requires spinning up all drives anyway. You can't do a parity verify on one drive. You do it on the array. I prefer instantaneous to swift, but unfortunately have to live with the laws of phsics.
July 18, 200817 yr You won't get "drop-in" Apache without a LOT of work, and the Apache you end up with will be bastardized... for example, Apache wants a user and group "apache" and while you can bastardize it otherwise, you create security problems, and an install that is so non-standard, folks experienced in Apache will be left scratching their heads troubleshooting it. That is why I suggest looking at lightweight HTTP server, that will do CGI and shell out for scripts and other executables. Agreed, but part of this is also held back by lack of blessing in this thread. http://lime-technology.com/forum/index.php?topic=1953.0 This is a classical fallacy I've seen in many projects. You don't need blessing for this... you need the blessing/assistance of Tom for SOME of it to work, such as firing scripts at different unRAID status levels. Much of what we are discussing can be implements without those hooks. Don't let 10% of desired features hold back from doing the project without them.
July 18, 200817 yr I'm not opposed to some form of check when the system starts, but it should 1) start AFTER the array starts, so the server is available as fast as possible, and 2) should be a light-weight test that does not tax the server for normal use (i.e. not a complete parity rebuild). As for full parity rebuilds, if you want to run it on some schedule for PM or at boot, you need a throttle ... such as milliseconds to sleep between blocks ... so as to not severely impact regular operations.
July 18, 200817 yr Here's a problem as I see it. If parity is not inline with what is on the drive, then somewhere along the line, something is not done correctly. If you cannot rely on the parity to be accurate, something is wrong. > I don't think comparing RAID and unRAID is appropriate. It's totally appropriate. The name itself indicates the same thing. Redundant Array of Inexpensive Disks. Each has some mechanism of protecting your data. How it's done is not an issue. > If unRAID gets out of sync, you can operatie for a VERY long time and not know it. Very different. I think this is a real potential problem that should not happen. Regardless of the expansion nature of unRAID. Now if someone is mounting filesystems outside of the array and working in them, I see the parity protection getting out of sync. Yet while inside the array confines, the parity should be accurate. > Since the parity check is already there, and all that you'd need to do is add a way of scheduling it, I think doing it that way may be a better use of Tom's time. Besides, I do not think a weekly parity check is that bad of a situation. In fact I'm pretty surprised that it's needed Huh These two statements are confusing together. I think you might almost be agreeing that a periodic parity check might be a good thing, but having a hard time getting the words "I agree" out Cheesy Cheesy Cheesy If the potential for parity to be that out of sync exists, then a regular validate needs to be on a regular schedule. I agree. LOL. But I'm also surprised that it's needed. > You can't do a parity verify on one drive. You do it on the array. I'm aware of this. I was quoting someone else, but after some thought realized that all drives have to be online anyway.
July 18, 200817 yr I'm not opposed to some form of check when the system starts, but it should 1) start AFTER the array starts, so the server is available as fast as possible, and 2) should be a light-weight test that does not tax the server for normal use (i.e. not a complete parity rebuild). As for full parity rebuilds, if you want to run it on some schedule for PM or at boot, you need a throttle ... such as milliseconds to sleep between blocks ... so as to not severely impact regular operations. I'm opposed to the check at start. This should be an option. 1. At each start. 2. At an interval. Are you filesystems fsck'ed on every reboot? No, If an error or dirty shutdown occured they are checked, or a journal is rolled back. after a certain interval an fsck is forced to verify the whole environment. I think unraid parity verify should work like that. From what I can see, there is a "check" option in the md code. Does this verify or do a full rebuild? > you need a throttle ... such as milliseconds to sleep between blocks ... so as to not severely impact regular operations. Now THIS is what is sorely needed! with the standard md driver, there is the following to help control thresholds so that resync does not consume all available usage bandwidth. cotrone@gatekeeper: /proc/sys/dev/raid > ls -l total 0 -rw-r--r-- 1 root root 0 Jul 18 10:13 speed_limit_max -rw-r--r-- 1 root root 0 Jul 18 10:13 speed_limit_min rcotrone@gatekeeper: /proc/sys/dev/raid > more * :::::::::::::: speed_limit_max :::::::::::::: 10000 :::::::::::::: speed_limit_min :::::::::::::: 100
July 18, 200817 yr CROSS POST FROM OTHER THREAD. It has been requested that the parity "check" not perform corrective actions, but instead only report what is found. A second "Parity Calculate" feature wold then do the correction. This would potentially be useful in some situations. OK you answered my questions. We need * Parity Validate * Parity Calculate * Speed Threshold Control
July 18, 200817 yr I'm not sure how the standard MD speed limits work, but in some other apps, they suck, as they are based on CPU, not disk I/O. WinRAR for example, it can be given the lowest priority possible, but it still slams the system to a crawl because of disk I/O, even though CPU never exceeds 5%.
July 18, 200817 yr The standard MD_SPEED_LIMIT works very well. It's based on how much data is read and worked with. I've been using it for years, if you have to do something you can lower the limit and the resync impact is negligible .
July 18, 200817 yr From previous thread: The parallel read of all disks during a parity check is much faster. ... but slows the regular operation of the system to a crawl for a long time. If "safety" has too much negative impact on the user, people won't do it. Look at the anti-virus "boot scans" and "real time protection" that will take a Windows box to a crawl... many people disable them because of that. If you can throttle a parity check out over a 2 or 3 days, then I'd consider it as acceptable periodic maintenance....as it is now, I consider it periodic DOWNTIME. "Honey, why is this move not playing right?" "Sweetie, the server is doing maintenance." "I hate your #%^&# movie thingie... it never works right." (translation, it doesn't function as expected 0.01% of the time) Perception is, after all, reality.
July 18, 200817 yr The answer is below.. Duplicate the standard MD behavior of SPEED_LIMIT_MIN and SPEED_LIMIT_MAX. Then have scheduled parity validation. 1. On boot if superblock error detected (this occurs as a parity calculate now which is good). 2. On scheduled interval (over x days of use without parity validate.. like ext3 fs). 3. On Scheduled time selected by user if #2 is not what they want. Quite frankly, I would like to see some of this occur at the filesystem level too. I.E. Scheduled filesystem fsck on boot just like you can do with windows or linux via the autofsck flag.
July 18, 200817 yr As was pointed out in the prior thread, there is no parity validation function in the driver. Looking around in the source, it looks like the throttling ability from MD is not there. A timer throttle (sleep milliseconds between blocks for example) would be simpler to implement, although users would have to determine their own throttle setting for optimum behavior.
July 18, 200817 yr Author I think you're heading into the technological trees, and are losing sight of the forest. My suggested startup check is a quick (30 second - 1 minute) sanity check, not a full blown parity check that requires throttling of CPU. The goal was singlemindedly - to make sure that people's parity protection is really in place and not an illusion. We are seeing evidence that this is a problem. It is a safety feature. I can't imagine that anyone reading this thread would be more than minorly impacted by a 1 minute boot delay. Most people leave the computers on 24x7 anyway, so we're talking about running this maybe a couple dozen times a year, mostly when new versions of unRAD need to be installed. If you actually start the array, users can start writing to the array. The nondestructive intent of the test is lost. Not good. Now if we're talking about an alternate parity check type function, to find and correct bad sectors, that runs at a low priority for long periods of time - I would never use that. I don't want ANYTHING lowering the speed of the array! I can easily schedule hours that I am not using it at all and it can run to its heart's content And keeping the drives spinning for very long periods of time takes away much of the green appeal. Doing it a drive at a time means you are losing the value of verifying parity - a worthwhile check. What's wrong with scheduling a parity check to run once a week, month, quarter, year ... whatever - overnight - when it is not being used.
July 18, 200817 yr We are talking about two different issues here. Issue 1. A validation that the parity calculations are correct. Issue 2. Exercise of the "SMART" features of a disk to re-allocate sectors (on the next "write") that have been identified by SMART as unreadable. For the first we need a "read-only" parity validation process. Personally, I've only seen parity wrong once, when power was lost while writing to the array. I had one sector in error. The parity check upon reboot fixed it. I've never seen parity wrong otherwise. (but then, I don't write to the underlying file partitions on the raw disks either) For the second, we need to read and subsequently write to every sector marked for re-allocation on every disk. It has NOTHING to do with parity calculation, other than a parity calc reads all the blocks on all the disks. To do this one disk at a time will take a very long time... not hours, but days, if you have any size array. Joe L.
July 18, 200817 yr > My suggested startup check is a quick (30 second - 1 minute) sanity check, not a full blown parity check that requires throttling of CPU. I don't see this as worthwhile unless you are going to inspect the S.M.A.R.T. statistics and do something smart based on it. > The goal was singlemindedly - to make sure that people's parity protection is really in place and not an illusion. We are seeing evidence that this is a problem. perhaps the "check" and "nocheck" commands/options are something that can be used to start a check and stop it. If there were a validate perhaps a command could be used to set a starting point and a duration point. > I can't imagine that anyone reading this thread would be more than minorly impacted by a 1 minute boot delay. it would disturb me to no end if I'm developing something and have to reboot multiple times. I already have to wait for the boot up process. Furthermore if what I'm doing in a package install depends on the disks being active, now I have to add more code in there to be smart about halting until the disk is mounted. (how long, when do you give up). Until emhttp controls runlevel package installation, I want the array to come up. Just like in the normal linux RAID1 drivers. > What's wrong with scheduling a parity check to run once a week, month, quarter, year ... whatever - overnight - when it is not being used. Nothing. It just needs a cron job and an interface for managing it.
July 18, 200817 yr To do this one disk at a time will take a very long time... not hours, but days, if you have any size array. Agreed, it was a thought, but after deeper thought, it's not worthwhile to validate single disks at a time.
July 18, 200817 yr I think we are missing each other's thrust. Philosophy 1: test before user access A test done on each boot before user access needs to be fast.... that is a given. But, there is no fast test that verifies parity. The question is, is there any test that can be done FAST, that can determine with some reliability, that the array/parity is healthy, or not? Philosophy 2: Low-impact, but more thorough testing. This can be done on each boot, but gives user access while it runs. But this allows write access before check is finished. Of course,there is no reason you can't do both. For true maximum protection, you would want an off-line parity check done each boot, and complete before user access. Of course, that's overkill.
July 18, 200817 yr I'm reminded that trying to make something totally idiot-proof results in something idiotic... I agree that a few things might be added to unRaid to make it safer (think re-store button) such as scheduled parity checks IF DESIRED, but adding more complexity to take into account a situation that has a probability to occur of 0.01% simply makes the system easier to break...
July 18, 200817 yr If a validate/verify (or checkreadonly) option where available with options for a starting sector and max sectors then I think the test could be achieved. Frankly, I looked into the md code then unraid code and it's going to be a fair amount of work to do that. So I would think scheduled weekly/monthly/whateverly checks are appropriate. This can be put in the user interface with smaller fuss because there is already a similar interface for the mover script. My only thing with this is resync_speed. I did see where percentages are calculated, perhaps throughput could be used to set a max and give up the time slice for other work so that it does not overwhelm the system.
July 18, 200817 yr Author I think we are missing each other's thrust. I think you're right! Philosophy 1: test before user access A test done on each boot before user access needs to be fast.... that is a given. But, there is no fast test that verifies parity. The question is, is there any test that can be done FAST, that can determine with some reliability, that the array/parity is healthy, or not? This is what I was shooting for. It has to be fast and it has to do a reasonable job of finding AT LEAST ONE parity error if the disk is full of them. I think it is possible to define a fast test that does this. The user that just reported 1000s of parity errors - I'd bet that a check of the first 50M (probably a test of the first 5M) of the disk would have shown a problem. The user that reported a problem with the disk not cleared with zeros would have been detected by the random sector checks in each gigabyte. The purpose of this test is to find gross problems, like disks not cleared and wrong parity disk assigned in the slot. It is not to replace the parity check that occurs after a dirty shutdown, or the scheduled parity check that you should run periodically for preventative maintenance. Perhaps another script could be specified to be called after the array is fully ready to trigger loading of extra packages. This was just an idea. If users were doing routine parity checks this might be unnecessary. Even if that were in place, however, It still has some appeal for detecting a bad parity drive afer a "mdcmd set invalidslot 99" command. Philosophy 2: Low-impact, but more thorough testing. This can be done on each boot, but gives user access while it runs. But this allows write access before check is finished. Of course,there is no reason you can't do both. For true maximum protection, you would want an off-line parity check done each boot, and complete before user access. Of course, that's overkill. This could also be done, but I think that this is better handled through the scheduling model than through the boot model. I wouldn't want to have to wait through a 6h+ parity check with each boot.
July 18, 200817 yr like disks not cleared and wrong parity disk assigned in the slot Here's what I don't understand, what does a disk not cleared have to do with invalid parity. if a disk is not cleared and parity is calculated based on what is on the disk, the parity should be accurate to what is already there. I had gotten the hint that clearing the disk was for the benefit of speeding the calculation,. but also preventing a resierfsck from picking up old data and mistaking it for valid data. The other thing is how do you know when a disk is really clear? I would think you need to know about the underlying filesystem to know what data is valid and what is not. I wouldn't want to have to wait through a 6h+ parity check with each boot. Oh man that would suck! I remember days of yesteryear when a RAID1 resync was done before the system could come online! Those were painful times! Perhaps another script could be specified to be called after the array is fully ready to trigger loading of extra packages. This has been suggested and discussed in another thread. We're waiting on our great leaders blessing or guidance. ;-)
July 18, 200817 yr The question is, is there any test that can be done FAST, that can determine with some reliability, that the array/parity is healthy, or not? The basic crude/fast sanity check for the array is the status bit in the super.dat superblock file combined with the checksum of the superblock. If it does not indicate the array was shut down cleanly, then parity is suspect. If the bit shows the array was stopped cleanly, and the super.dat checksum is corrrect, then the assumption is parity is correct. Parity does not just get"out of sync" That test for super.dat sanity is already in place.
Archived
This topic is now archived and is closed to further replies.