August 22, 201312 yr Hi, Have been running Unraid for over 2 years without a single issue. Yesterday the server got turned off suddenly by some contractors doing work in the house. When I turned it back on I got two separate errors. See screen shot below. The first is that drive 10 has "issues." See screen shot below. Essentially it has a red ball beside it and show 576 errors. The last schedulled parity check on the first of the month was perfect. The second issue is that the cache drive now shows as unformatted. The cache drive previously worked great. I'm not sure how to proceed? Should I deal with the bad drive first and simply replace it? Utilizing these instruction from the FAQ: How do I replace a hard disk? The procedure to replace a drive is essentially the same, whether you are upgrading the drive to a newer or bigger disk, or replacing a failed drive. Here is the procedure, but first view this post for screen shots and helpful descriptions and comments. On the Devices page of unRAID Web Management, record the current drive assignments by screen capture, screen print, or old-fashioned notes by hand If there are no failed or missing disks, it is a best practice to perform a parity check before doing anything else. (This will not be possible if there are any failed disks as all disks must be present and working to check parity.) If there are errors, resolve these before continuing. Make a copy of your "config" directory. (it will simplify recovery if something were to go wrong) Stop the array and power down the array Remove the bad drive and install the new drive (or remove a smaller drive and install a larger disk in its place) Boot the server Check the Devices page again, and assign the new drive where the bad drive was; make sure all other assignments are still correct Return to the Main page, and click the little check box under the Start button that says "I'm sure I want to do this", then click the Start button to Start the array and start the rebuild of the replaced drive Drive will now be rebuilt, takes a while; the array can be used at the same time, but we recommend waiting until the rebuild is complete as it will greatly slow down the rebuild process. Or should I deal with cache drive first? I have never had a problem before so I'm quite nervous. As you can see I have quite a bit of data and I don't want to lose any of it including that which is on the bad drive.. Any advice appreciated. Also I received an "email" about this error from my machine titled: Subject:unRaid Failure Notification - One or more disks are disabled or invalid. Status update for unRAID Tower - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Status: ERROR: The unRaid array needs attention. One or more disks are disabled or invalid. Disk 10=DISK_DSBL Server Name: Tower Server IP: 192.168.1.10 Date: Thu Aug 22 04:47:19 CDT 2013 There is a whole bunch of technical information in the rest of the email which I can attach if needed.
August 22, 201312 yr I'm not sure why the cache shows as unformatted, but I'd deal with the disabled disk first so you're not running "at risk" (any more failures = lost data). I gather from your concern that you don't have backups of your data ... but that's another issue. First, I'd shut down, and re-seat disk #10 and all of its cables [this could be a simple case of a loose connection]. Then restart and see if the system begins a rebuild on that disk (if so, let it finish). If not (i.e. it's still redballed), then I'd replace the disk with another 2TB drive SOON ... and let it do the rebuild.
August 22, 201312 yr I agree with garycase, deal with data disk first. what you can do is shutdown the server, re-seat the disk10 while all is down pull the cache drive out and try it on on another PC. (for widows you will need to get program that can mount the rasfs like "http://yareg.akucom.de/") if you can read it copy everything from it just in-case you can also try reading the disk #10 like that and maybe back up the data from it. put everything back, and restart. now if you can not read the disk 10 on other PC than I would suggest running a preclear on it and see the result but let the more knowledgeable members here chime in on this
August 23, 201312 yr Author Update: I stopped the array and powered down. Then reseated all drives and restarted. The "red balled disk" stayed red, but the cache drive lost the "unformatted" designation. I have replaced the bad drive and "a data rebuild is in progress. Thanks to all that helped. I will try the suggestion of reading from the bad disk on a windows machine to try and see what's up.
August 24, 201312 yr Sounds like the cache was just a loose connection ... or possibly you just didn't wait long enough (a refresh may have shown it was okay originally). In any event, UnRAID is working just like it should => you were able to remove a failed drive, replace it, and lose nothing in the process. That's why we all have fault-tolerant servers
August 24, 201312 yr Author Update: the rebuild is finished. I now get a "parity is valid" message, which make me happy. There were 7 sync errors which has me somewhat concerned because I'm not sure exactly what this means. From the FAQ: Check parity When the array is Started and parity is already valid, there is a button in the Command area labled Check which will initiate a background Parity-Check function. Parity-Check will march through all data disks in parallel, computing parity and checking it against stored parity on the parity disk. If a mismatch occurs, the parity disk will be updated (written) with the computed data and the Sync Errors counter will be incremented. The most common cause of Sync Errors is power-loss which prevents buffered write data being written to disk. Anytime the array is Started, if the system detects that a previous unsafe shutdown occurred, then it automatically initiates a Parity-Check. Here is the screen shot: I think everything is fine now and I don't need to worry about the sync errors since the parity has been rebuilt and is now valid. Is this correct?
August 25, 201312 yr Author Re-ran parity check without errors. All seems to be working. Thanks again...
August 25, 201312 yr Well ... Yes and No. IF parity was valid (zero sync errors on your last check before the rebuild) it should have still been zero after a rebuild. A rebuild does NOT check parity ... it can't, as it's using the other N-1 disks to reconstruct the remaining (bad) disk, so has no way of checking what it SHOULD be => it simply assumes all is well and writes the appropriate bit on the new disk for every bit in the array. So at the end of a rebuild a parity check should ALWAYS show zero sync errors UNLESS there have been errors in the rebuild. IF, however, you had 7 sync errors from a previous non-correcting check, then the rebuild will have 7 errors in the data at some point. [if you're going to do non-correcting checks, you should ALWAYS do a correcting check immediately after a check that shows errors -- first, of course, checking that all connections are secure in case that was the cause of the errors => but you ALWAYS want a check with zero sync errors as the end result of your periodic parity checks] With the result you had, I'd start running a file comparison utility against all of your backups. If you don't have backups, there's nothing you can really do except recognize that buried in your array are likely 7 errors.
August 25, 201312 yr By the way, if you don't have backups (you should); and don't plan to rectify that; then you should at least keep MD5's for your files, so if something like this happens again you can at least confirm that your files are good. This little utility works well for that: http://corz.org/windows/software/checksum/ You install it on a Windows machine; then open Windows Explorer, navigate to \\tower; then highlight disk1; right-click; and select "create checksums". Repeat that for each of the disks in the array. Note: This will take 10-15 hours/disk to do ... so you may want to do one disk every night for a while. After you've done it the first time, you can do it again to update any modified files, and it will be updated very quickly (click on "Synchronize" when prompted). You can then check the contents at any time by right-clicking on the disk, or any folder in it, and selecting "Verify checksums". Not as good as a backup (i.e. you can't restore a bad file) ... but at least it will let you know if you HAVE any bad files.
August 26, 201312 yr Author I'm confused....I assumed my parity drive was valid before I replaced the bad disc...my last parity check had no errors. After rebuilding the failed drive I had 7 sync errors...and then I ran a parity check with no errors. What should I have done differently?
August 26, 201312 yr Thanks for pointing out this great piece of software I been looking for a way to hash all my files and easily update the hash list when new files are added. I didn't like the way it put a .hash file in of my folders so I edited the defaults in checksum.ini in C:\Users\YourName\AppData\Roaming\corz\checksum so that it created one file in the root of the drive I also changed the default from MD5 to SHA1. I take it if I delete something out of the directory and it flags the files as missing I cant just update I have to create a new .hash file? It asks me if I want to synchronise as it finds an existing .hash there which is cool when I add to a directory / drive. Currently playing with this on a few test folders at the moment. I was able to create a SHA1 hash list of what was on a drive with hashdeep and was able to check/audit against it but couldn't figure out how to get it to show what was missing when i removed a few test files, it showed files that weren't in the list. Also couldn't figure out how to add to a hash list, only create. I'm not code / scrip savy at all really so couldn't create a scrip to do it for me. Shame unraid doesn't have a system or plugin in place like this to catch bit rot, oh well can dream, one day maybe lol. By the way, if you don't have backups (you should); and don't plan to rectify that; then you should at least keep MD5's for your files, so if something like this happens again you can at least confirm that your files are good. This little utility works well for that: http://corz.org/windows/software/checksum/ You install it on a Windows machine; then open Windows Explorer, navigate to \\tower; then highlight disk1; right-click; and select "create checksums". Repeat that for each of the disks in the array. Note: This will take 10-15 hours/disk to do ... so you may want to do one disk every night for a while. After you've done it the first time, you can do it again to update any modified files, and it will be updated very quickly (click on "Synchronize" when prompted). You can then check the contents at any time by right-clicking on the disk, or any folder in it, and selecting "Verify checksums". Not as good as a backup (i.e. you can't restore a bad file) ... but at least it will let you know if you HAVE any bad files.
August 26, 201312 yr IF parity was valid (zero sync errors on your last check before the rebuild) it should have still been zero after a rebuild. A rebuild does NOT check parity ... it can't, as it's using the other N-1 disks to reconstruct the remaining (bad) disk, so has no way of checking what it SHOULD be => it simply assumes all is well and writes the appropriate bit on the new disk for every bit in the array. So at the end of a rebuild a parity check should ALWAYS show zero sync errors UNLESS there have been errors in the rebuild. Gary, this is a known issue with 4.7, but it has never been shown to cause data loss. Apparently the rebuild process on 4.7 doesn't keep parity in sync while writing to some non-data area of the drive, and almost always causes a few sync errors at the beginning of the run. I'm confused....I assumed my parity drive was valid before I replaced the bad disc...my last parity check had no errors. After rebuilding the failed drive I had 7 sync errors...and then I ran a parity check with no errors. What should I have done differently? Nothing. This is a known 4.7 bug. It doesn't affect data, so don't worry about it. It's one of the reasons people are so touchy about the new release, as the latest final (4.7) still has unfixed bugs, and there hasn't been a fix released for 4.7. This particular bug was taken care of pretty early on in the 5.0 beta cycle, but many people don't like running server software with beta or rc in the name.
August 26, 201312 yr this is a known issue with 4.7, but it has never been shown to cause data loss. Very interesting. Makes sense, since the system can NOT be computing/checking parity during a rebuild -- it's apparently just a display anomaly, since a parity check immediately afterwards is showing zero parity errors (which means everything's actually okay, despite the displayed error count after the rebuild).
August 27, 201312 yr The system is supposed to maintain parity if writes occur during a parity build. The issue in 4.7 is that this is not handled correctly. Writes occur when the file system is mounted as the parity build starts. The errors result in parity mistakes and a correcting parity check fixes them. Data is not effected. Version 5 does not have this issue.
August 27, 201312 yr Ahh ... so the issue isn't an issue with the rebuild ... it's an issue whereby parity isn't correctly maintained for new data written to the array DURING the rebuild. That explains how a sync error can occur ... and since it's simply a lack of maintained parity, there's no data integrity issue -- simply a parity issue. I've never encountered that -- but on the other hand I NEVER use my server during a rebuild (or, for that matter, during a parity check).
Archived
This topic is now archived and is closed to further replies.