Problems with array. Need advice.

October 16, 200817 yr

I'm pretty sure I have a flaky motherboard. It's been giving me some minor problems with my lone IDE drive. A while back, after a reboot, unRaid would sometimes report "Replacement disk too small" and the array would be stopped. I found that after a reboot or two, it would report it as the correct size and the array would be started and fine.

The last couple days I've been playing around a lot with the UPS script and subsequently rebooting a lot. I noticed one time after a reboot that unRaid reported this same drive as "not installed". I didn't think much of it in light of the problems I'd been having and since I wasn't going to be actually using the array and only playing with some scripts and stuff I ignored it. Tonight, I wanted to put the server back in action and noticed the same "not installed" message and the array was started when I checked out the web GUI. I tried the usual fix of rebooting and it now reports "Disabled disk replaced", the array is stopped, and the drive has a blue dot next to it. "Start" says it will perform a data rebuild.

I'm confused because the drive, as far as I know, still has all it's original data, and, as far as I know, a parity sync was never performed while the drive showed as not installed. I assumed the reason the array was started even thought the drive was not installed was that it was getting the missing data from parity. So why isn't the array showing as fine now that the drive is showing again? Am I correct in thinking that unRaid thinks the parity no longer matches the data on the drive? I would assume that if the data on the drive is correct the thing to do would be to hit restore and then let it do a parity sync. But why would the parity show as different from the data on the drive and is it the parity that changed or the data on the drive?

I plan to order a new motherboard tonight.

October 16, 200817 yr

So why isn't the array showing as fine now that the drive is showing again? Am I correct in thinking that unRaid thinks the parity no longer matches the data on the drive?

Since the machine and array came up saying the drive was disabled, the array is in a failure recovery mode.

The superblock has been updated as such.

After the drive came back online, It is now "out of date" with the superblock.

Even though you did not use it, a mount of the virtual drive would put the data slightly out of sync with the physical drive.

Plus the superblock has already marked it as disabled at one point so it invalidates it.

Chances are the data might be ok and valid, but internal tables on the virtual drive are slightly out of date with the physical drive.

If I'm understanding this correctly,

Either way parity will need to be brought up to date with all of the drives in a functioning state or the previously disabled drive itself will be rebuilt from the party calculated.

Food for thought, If the drive has to be rebuilt and is smaller then the rest of the array, it may be shorter to do that.

October 16, 200817 yr

Author

So you're saying either restoring (parity sync) or starting (data rebuild) should ultimately have the same end result?

So why did the superblock update now, but not when I was having prior problems of the drive showing as too small? In that case, when the drive would ultimately appear correctly, the superblock must have not have updated. The only difference I can think of is that the array never started whent he drive was too small, but it did when unRaid thought it wasn't installed. Is that what made the superblock update and require the data rebuild?

October 16, 200817 yr

Save your self a lot of time, and use Tom's procedure, found in this HowTo: Make unRAID Trust the Parity Drive, Avoid Rebuilding Parity Unnecessarily.

I used to have the same problem you had, a number of times because of my very flaky nForce4 board, and yes, either rebuilding the drive or rebuilding parity are equivalent in result. If the data drive is smaller than the parity drive (as mine was), then rebuilding the data drive would be quicker. But both take a lot of time, and are just writing to the rebuilt drive what is *already* there.

That is what makes Tom's procedure so much better. Just perform it, and abort the parity check (unless you *want* to let it finish).

October 16, 200817 yr

Author

Thanks for that tip. I'll let it go ahead and do the parity check, but that's nice to know I don't necessarily have to. I still wonder though... was it the array starting that caused the superblock to change where before it didn't?

October 16, 200817 yr

was it the array starting that caused the superblock to change where before it didn't?

I don't know enough detail of the other situation to speculate.

In the case here, the array HAS to account for every drive, to know that parity is valid. If one is dropped for any reason, then it can't restore it to the array without either rebuilding parity or accepting parity as good and rebuilding the drive back to the exact state it was. It cannot make any assumptions of what might be different about that drive. Parity is only valid when unRAID *knows* the state of every bit of every drive included in the array. What Tom's procedure does is allow us to tell the system that parity is in fact valid, and so it does not need to rebuild anything. It still wants to verify that for itself, with a parity check.

October 17, 200817 yr

Author

OK I let the parity check run overnight and it found 132 (I think ) errors. When I got home this evening the array showed fine. I just checked it again and all of the sudden the same drive now has a flashing red dot and the array is still started. Here is a screenshot:

As you can see I have enough free space on the other drives to just remove the problem drive, but I'm not sure if I should trust the parity to let it rebuild the data.

Here is the syslog:

http://pastebin.com/m145c813e

Once again, any advice is appreciated.

October 17, 200817 yr

The drive does not appear to have spun back up, and is now reporting Drive not ready. I can't say what exactly the problem is, but definitely this motherboard and this drive are not getting along, and each time the drive fails or is not recognized at boot, the array is hurt. Even if you regained communications with the drive, DMA was disabled for it, so it would be VERY slow. It all happened at 'Oct 16 19:32:19', if that helps in any way.

You might try changing the cable, and/or changing to a different IDE connector. Get a SMART report for it, and check for very recent errors listed. You may have to give up on this drive, on this motherboard.

October 17, 200817 yr

Author

I'd like to just remove it from the array and rebuild it's data onto the available space on the other drives, but how can I know to trust the parity?

October 17, 200817 yr

I'd like to just remove it from the array and rebuild it's data onto the available space on the other drives, but how can I know to trust the parity?

The steps to do that would be:

1. Copy the data from the failed drive to the others. ( it will be reconstructed from parity and the other data drives, so this process will spin up all the drives, and it will be slower than copying from one single drive to another.)

1. Verify the files you copied by playing them, looking at them, etc in their new locations. Sorry to say, you can not do too much more to verify parity rebuilt them... (you can look in the syslog for errors during the copy, if lots of errors, beware... do not go further in this series of steps, post a new copy of the syslog.)

2. Once you have all the data copied from the failed drive with NO errors, stop the array and go to the devices page and un-assign the drive you will be removing.

3. Go back to the main page, check the checkbox under the button labeled "restore" and press it. It DOES NOT restore data, but instead renames the super.dat file so unRAID does not find it. This will immediately invalidate parity and cause unRAID to completely rebuild the super.dat file based on the currently assigned and working drives, and then to completely calculate parity based on the new configuration. It will do that full parity calc when you press the "Start" button to start the array.

Joe L.

October 17, 200817 yr

1. Verify the files you copied by playing them, looking at them, etc in their new locations. Sorry to say, you can not do too much more to verify parity rebuilt them... (you can look in the syslog for errors during the copy, if lots of errors, beware... do not go further in this series of steps, post a new copy of the syslog.)

if you so happen to get the old drive online somehow after copying the data from the "virtualized version" you can use the cmp command to do a binary comparison of the file.

I did this one time and found all the files compared.

October 18, 200817 yr

Author

I'm not sure is it was the source of the problem or not, but I checked cable connections and everything and happened to notice that the jumper on the drive was set to slave instead of master. It is the only IDE drive in the system. I changed the jumper and rebooted and the drive appeared correctly, but with a blue dot which I took as normal under the circumstance. I rebooted four more times and it showed the same each time which is a pattern of consistency that has not been the norm lately. I checked the contents of the drive and it appears to be correct (although I can't remember every video that was on the drive), but I didn't take the time to try to play every video to check corruption. Being that it's only TV shows and not really important data, and the data on the drive seems to be correct, I went ahead and hit restore and it's now rebuilding parity.

Could the drive being set to slave when there was no master present been the cause of these problems? If it was, it seems weird to me that the drive would sometimes boot up fine.

October 18, 200817 yr

I changed the jumper and rebooted and the drive appeared correctly, but with a blue dot which I took as normal under the circumstance. I rebooted four more times and it showed the same each time which is a pattern of consistency that has not been the norm lately. I checked the contents of the drive and it appears to be correct (although I can't remember every video that was on the drive), but I didn't take the time to try to play every video to check corruption.

When the array is in a degraded state the contents of the missing/defective drive are supplied via the parity drive and all the other drives.

When you checked the contents, you were not checking the physical drive (unless you yourself mounted it as part of the unprotected disks)

For that reason, the replacement drive could have shown blue AND BE ENTIRELY EMPTY, and you might think the files on it were OK, but they might be gone.... Pressing restore in that situation would result in the complete loss of everything you had on the drive that failed.

I've said this many many times... Unless you are removing a drive from the array, and all the other drives remaining are working and have no known errors,

the "restore" button is almost always the WRONG button to press.

Being that it's only TV shows and not really important data, and the data on the drive seems to be correct, I went ahead and hit restore and it's now rebuilding parity.

If the "blue" drive was reformatted, odds are all your files on it will be gone. If it was not, odds are good they will be there for you.

You apparently like to gamble. You will find out how you did soon enough.

Could the drive being set to slave when there was no master present been the cause of these problems? If it was, it seems weird to me that the drive would sometimes boot up fine.

You might still have issues, but only time will tell.

October 18, 200817 yr

Author

I know about parity making it appear as though the data is still on the drive when it may not be. I checked the contents of the drive by Telneting to /mnt/disk8/ thinking this would show me the actual contents of the actual disk and not the "virtual data". Was I wrong? I don't like to gamble, I just thought I was doing the right thing.

And no, the drive was never formatted.

October 19, 200817 yr

If a drive is being simulated, its mount point (/mnt/disk?) is being simulated. UnRaid has disassociated the physical drive.

If the physical drive is in the machine and functional, you would have to manually mount it (or use unmenu to mount it) to be able to access its files.

October 19, 200817 yr

I know about parity making it appear as though the data is still on the drive when it may not be. I checked the contents of the drive by Telneting to /mnt/disk8/ thinking this would show me the actual contents of the actual disk and not the "virtual data". Was I wrong?

Yes, you were wrong.

I don't like to gamble, I just thought I was doing the right thing.

We understand, most of us have unRAID servers because we value the data on them high enough to not trust an unprotected disk.

And no, the drive was never formatted.

Then you will probably be fine. I would perform a reiserfsck on the drive anyway to verify it has no file-system corruption.

Oh yes, unless explicitly instructed by an experienced user of this forum, or Tom at lime-technology, please take the phrase "Press the Restore Button" out of your vocabulary. Lucky for you the file-system on the misbehaving drive was recognized, so it was not cleared and reformatted. Otherwise, pressing "restore" and then pressing "Start" to bring the shared drives online would have basically asked your server to throw away any contents the failed drive ever had, and throw away any parity data that might have been used to reconstruct it.

Joe L.

October 21, 200817 yr

Author

All my data seems to be present now, but it's hard to tell as I never had a list of all the files on it. Anyway, I ran a file system check from within unmenu and here is what it returned:

Will read-only check consistency of the filesystem on /dev/md8
Will put log info to 'stdout'
###########
reiserfsck --check started at Mon Oct 20 22:12:20 2008
###########
Replaying journal..
Reiserfs journal '/dev/md8' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..finished
Comparing bitmaps..Bad nodes were found, Semantic pass skipped
1 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Mon Oct 20 22:16:34 2008
###########
block 122096625: The level of the node (0) is not correct, (1) expected
the problem in the internal node occured (122096625), whole subtree is skipped
vpf-10640: The on-disk and the correct bitmaps differs.

/dev/md8 mounted on /mnt/disk8

Samba Started

What does this mean? I looked for 'stdout' where it said it put the log info, but I ws unable to find it.

October 21, 200817 yr

maybe a little Google searching before you post ??

http://www.google.com/search?hl=en&q=stdout

October 21, 200817 yr

All my data seems to be present now, but it's hard to tell as I never had a list of all the files on it. Anyway, I ran a file system check from within unmenu and here is what it returned:

Will read-only check consistency of the filesystem on /dev/md8
Will put log info to 'stdout'
###########
reiserfsck --check started at Mon Oct 20 22:12:20 2008
###########
Replaying journal..
Reiserfs journal '/dev/md8' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..finished
Comparing bitmaps..Bad nodes were found, Semantic pass skipped
1 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Mon Oct 20 22:16:34 2008
###########
block 122096625: The level of the node (0) is not correct, (1) expected
the problem in the internal node occured (122096625), whole subtree is skipped
vpf-10640: The on-disk and the correct bitmaps differs.

/dev/md8 mounted on /mnt/disk8

Samba Started

What does this mean? I looked for 'stdout' where it said it put the log info, but I ws unable to find it.

"stdout" is the standard output of the command, not a specific file. You have the output, it was sent to the browser... and unmenu also sent a copy of it to the syslog.

It says there is corruption. and it says to re-run reiserfsck with the --rebuild-tree option.

To fix this requires a series of commands be performed where you

stop samba

un-mount the drive

run reiserfsck with the rebuild-tree option on the drive

re-mount the drive

re-start samba

This is all described in the wiki here: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

your commands would then be

samba stop

umount /dev/md8

reiserfsck --rebuild-tree -q -y /dev/md8

When it finishes, you would then type

mount /dev/md8 /mnt/disk8

/usr/sbin/smbd -D

/usr/sbin/nmbd -D

(The wiki is slightly out-of-date, if you follow it and use "samba start" it will start but you will not see any shares on the LAN. use the commands I show here instead)

Only run --rebuild-tree once on a drive...

Edit: fixed typo...

October 21, 200817 yr

run reiserfsck with the fix-fixable option on the driver

Minor typo here. Should say rebuild-tree, not fix-fixable.

October 21, 200817 yr

Author

jimwhite - Why would I have looked up stdout on Google when I thought I knew what it was? Granted, I was wrong and it's not a file, but people usually only look up things they don't think they know. Also, knowing what stdout really was wouldn't have prevented my post. I still would've wanted to know what the results of the check meant. Nonetheless, thanks for the link as it was educational.

Joe L. - What I was wanting to know by "What does it mean" was, can you tell from the log what happened? Does it indicate a physical problem with the hard drive or could it have been a flaky controller just wrote some bad data? When it says that one corruption was found, does that mean that one of my files is corrupted or is it one block or sector of the hard drive? I'm just trying to figure out what some of this stuff means so maybe it won't always be Greek to me.

I read what the screen said about running --rebuild-tree and I looked in the wiki on the reiserfsck command, but I wanted to wait until I found out what the results meant before I ran it. Plus, I have a new mobo coming in a couple days and I may just leave it down until then. I'd hate to end up with more corruption/problems before then.

Thanks for the heads-up on restarting Samba. That would have caused some confusion. Also, do you mean you should only run --rebuild-tree on a drive once ever, or only once in close succession?

October 21, 200817 yr

jimwhite - Why would I have looked up stdout on Google when I thought I knew what it was? Granted, I was wrong and it's not a file, but people usually only look up things they don't think they know. Also, knowing what stdout really was wouldn't have prevented my post. I still would've wanted to know what the results of the check meant. Nonetheless, thanks for the link as it was educational.

Joe L. - What I was wanting to know by "What does it mean" was, can you tell from the log what happened? Does it indicate a physical problem with the hard drive or could it have been a flaky controller just wrote some bad data? When it says that one corruption was found, does that mean that one of my files is corrupted or is it one block or sector of the hard drive? I'm just trying to figure out what some of this stuff means so maybe it won't always be Greek to me.

I read what the screen said about running --rebuild-tree and I looked in the wiki on the reiserfsck command, but I wanted to wait until I found out what the results meant before I ran it. Plus, I have a new mobo coming in a couple days and I may just leave it down until then. I'd hate to end up with more corruption/problems before then.

Thanks for the heads-up on restarting Samba. That would have caused some confusion. Also, do you mean you should only run --rebuild-tree on a drive once ever, or only once in close succession?

There is no way I know to tell you more of exactly what is corrupted. From what it says, one portion of your directory tree is not accessible because of the corruption. Now, it might be a portion you deleted, and the corruption occurred during the delete, or it might be something else. No matter what, it needs to be fixed.

As far as re-running "rebuild-tree" Don't do a second rebuild until you perform a basic reiserfsck check and it is reported as not needing any more repair. So, if the subsequent check passes, and at a later time some months/years from now it says to run rebuild-tree again, it is OK. (As you said, don't do two in a row on the same drive. The second will not fix anything additional and may cause you to do more corruption if the first was not successful.)

I learned about the out-of-date wiki advice for re-starting samba when I wrote the code for unmenu.awk. At first I did it exactly as it said, but I saw no shares on the LAN when samba was re-started. Had me pulling out hairs for a bit, and I have very few to spare)

Joe L.

October 21, 200817 yr

Author

Thanks for your help.

October 22, 200817 yr

Does it indicate a physical problem with the hard drive or could it have been a flaky controller just wrote some bad data? When it says that one corruption was found, does that mean that one of my files is corrupted or is it one block or sector of the hard drive? I'm just trying to figure out what some of this stuff means so maybe it won't always be Greek to me.

In this case, you are testing with a higher level tool, reiserfsck, and it can only look for and fix problems with the Reiser file system. There is no direct indication here of a physical drive problem.

The most common cause of file system corruption is either a power outage while writing to the disk, or a bad power spike. There are other causes, but they are more rare (a bad controller is one that's very rare). One other possibility would be a previous crash of the ReiserFS module.

Your questions are understandable, it's an important way to learn, but unfortunately, just as in so many areas in life, there are many questions that don't have easy and exact answers. The cost of providing an absolutely accurate answer could involve paying thousands to a hard disk lab to open up a drive and analyze all of its components and platter surfaces, as well as many thousands more to fully characterize all of the interactions of the various controllers, cables, motherboard, BIOS, kernel, and drivers. All we can do as unpaid helpers with a little experience, is extrapolate/interpolate? the probabilities of the many possible issues, from our own experiences and research. We are however working somewhat blindly, from a distance, with very limited data, so any advice given has a significant chance of error. It can often be classified as no better than conjecture and speculation. But we do try to be objectively helpful, in spite of relatively little information. Most error messages or indicators (and user descriptions) are extremely incomplete as to what the exact situation is.

For example, what is a parity error? The reporting of a parity error tells us so little of the complete situation. Supposedly, it is a parity calculation that resulted in the wrong answer, indicating that one or more bits are wrong. But which bit? Or which bits? And why? Or is it possible that all the bits ARE correct, but we were not provided with the correct bits to compare? How do we know that each drive was successfully read, and its bits were accurately returned and used in the calculation? If a drive (which drive or drives?) is faulty, the read may not be performed, and/or the bits returned may not be the correct ones, so the calculation will be wrong, for the WRONG reasons. In some cases, unRAID has not seemed to have been made aware of the lower level disabling or other faulty condition of a drive.

I often say nothing, because I don't feel confident enough in any ideas I may have, either in their objectivity or because of a sense of 'insufficient data'. And looking back later, on answers I have given, I wish I had held back even more, as I can see a lot of speculation in answers/help/ideas I have posted. In general, the best we can do is provide experience-based advice as to what actions to take, based on what has worked for us or others. But when you want answers that involve fully understanding what's wrong, then there is often very little that we can say that is not speculative, and most of us are understandably reluctant to say much more then. We know we are much more likely to be wrong. [apologies for length, and wandering off-topic]

October 22, 200817 yr

Wow. A disclaimer of epic proportion!

While it is true that a perfect understanding of root cause is unlikely in some cases, your forensic reviews of the syslogs more often than not provide valuable insights at least in the path to good health. It is exceedingly rare that advice given here makes a problem worse. In fact I think we tend to be very hesitant to give advice that even has the possibility of doing harm unless it is heavily caveated.

Please keep doing what you do!

Problems with array. Need advice.

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)