Upgrade to a larger drive does not work as described


Recommended Posts

I just got a replacement for one of my 250GB drives that had the DMA error problem.  When I pulled the bad drive out I stuck in (just for shits and giggles) an old 30GB dive in it's place and put very little data on it, data that I did not care about.  I did this mainly to test the upgrade process one more time, a process whcih I have found to be very inconsistent.  Well again it's doing weird stuff.  Following the instructions on the Lime website I shut down the array, powered down the box, put in the new drive and powered up.  As expected the system noticed the new drive and labeled it as "wrong" and on the administration page it asked me if I wanted to recover the data from the old (now removed) drive and expand the array, I checked yes and sat back and watched.  It did a mount and then a re-mount of each drive and then the system came up as started but next to the word "started" there is a yellow dot not the normal green one.  If I look at the new drive it's marked as "Disabled" and has a red dot and the system has not done any data rebuild.  None of this is as described on the Lime support webpage, what do I do from here....? Do I go to tools and re-set the array...? Will that trash my data....? See this is why I find this system to be very difficult to administer for the end user.  Any help would be great....!

Link to comment

The behavior you see would occur if there was an i/o error during the rebuild onto the new drive. Did you notice if the Errors counter was incremented for the now-Disabled drive?

 

To try the rebuild again, power-off, remove the drive, power up & start - drive should still be Disabled, but now missing.

Now power down again, check drive cabling/mounting/etc, power up & start - rebuild should start on drive.

 

I agree, we need to make major improvements to error case isolation and usability.

Link to comment

Oops, right those errors....!  OK just did as you suggested and the system came back up exactly the same way, no re-build, the new disk disabled etc.  And ZERO errors...!  I really had no data to speak of on the old disk so I'm OK with a reset but the root cause of this weird action is what I'm most interested in.

Link to comment

I've seen this every single time I've tried to enlarge a disk.  Tom, I had a mail thread with you while ago where you suggested that it might be related to using a reiserfs partition that was created under a different OS.  I can quote the email here to refresh your memory if you like.

 

Link to comment

Well I manually tried all 3 cases of data disk "upgrade":

 

1. No disks disabled - power down; replace small disk with large one; power up; Start.

2. Disk disabled, but still present - power down; replace old disk with new disk; power up; Start

3. Disk disabled, disabled disk removed - power down; install new disk; power up; Start

 

All worked as expected  ???

 

Rich, can you capture syslog and email to me?

 

 

Link to comment

OK so I got a little impatient waiting to hear what the log may have indicated so I reset the array and that's when something interesting came up.   I was OK with the reset because the removed drive had only test data on it, nothing I really needed.  So I stopped the array, reset it and as expected ALL drives came up looking as if they were "new" BUT the real new drive (the one that was larger than the one it replaced) also shows as "unformatted".  I clicked the format button below, that worked, launched a new parity check and that too worked with zero errors. Array is now back to normal with the new larger drive in place and working.

 

So to me it looks like the routine to upgrade to a larger drive does not include checking to see if the new drive is formatted and if not go ahead and offer a format, then expand the array and re-build the old data onto the new drive.  If I had a spare slot I could actually prove this by "adding" the new drive first into an empty slot which would get it formatted, then remove it, reset the array because of the missing drive, build an new parity and then try to upgrade the smaller disk with the new larger one that is now linux formatted, I bet it would have worked based on what happened when I did my reset.  So if this is true anyone with a full system can't upgrade a disk or replace a failed disk with a new bare drive unless they can find a way to get it Linux formatted first.  If all this is true, add this rather serious issue to the issues list.......! 

 

Tom, thoughts......?

 

Link to comment

Your syslog was completely filled with a bunch of spurious Samba messages, so I wasn't able to gleen anything from it.  (Samba is the linux application which implements Windows network shares.)

 

When you replace a disabled disk, or change out an enabled disk for a larger one, what happens it that the parity-rebuild operation will write the contents of the original disk to the new one block-by-block. This results in the new disk taking on the "format" of the original disk.  It doesn't matter how the new disk is previously formatted - it could be brand new (all zeros), could have NTFS, could have all random data - whatever was there, gets overwritten with whatever was on the original disk.  Meanwhile, any normal writes for that disk result in the block being written on the disk as well as updating the parity disk; and, any normal reads from that disk result NOT in reading the disk being reconstructed, but instead, reading all the other disks and returning the rebuilt data on-the-fly.

 

I know this is a bit confusing & I hope to get around to writing a tutorial on the subject, but realize that any new disk does not need to be "preformatted".

 

The fact that after an array reset, your new disk came up "unformatted" tells me that parity-sync never got started before, and there must have been some kind of error almost immediately.

 

Link to comment

Tom,

 

Great explanation (I think) but I think the bigger issue here is when a "real" user with "real" data has this problem we then have data at risk and have no clue from the webpage what to do.  Thankfully I was just testing so no big deal BUT.....!

 

Here is what I plan to do later today & tonight.....!

 

1. Shut down the array and remove that new (now Linux formatted) larger drive

2. Bring up the array and re-set because that drive is missing, no parity check however

3. Shut it down again and re-install the small dirve with the test data

3. Power it up and run a new parity check over-night

4. Now I'm back to where I was

5. Power down, remove the small drive and insert the larger drive

6. Power Up

 

Any bets what is going to happen....? Remember that new larger drive is now Linux formated, empty yes but pre-formatted.

Link to comment

OK so I followed the steps I listed above and the new larger disk still came up as invalid.  I sent Tom the syslog to see if there is a clue in there.  So my thought that the issue was the fact that the disk was raw with no format was not the issue at all.  Keep in mind that the drive has NO errors or issues, when I reset the array after the second failed upgrade atempt the drive was formatted and made part of the array with zero issues.

Link to comment

rharvey,

 

I experienced the same problem that you are having when I built my first unRaid array.

 

The new drive was marked as invalid and only resetting the array fixed the problem.

 

I notified Tom early on about this event and he said that he would look into it.

 

I did have one instance where a new 300GB data drive was slightly bigger than the 300GB parity drive and the system instructed me to swap the drives and then proceeded to rebuild the parity on the new parity drive. That seemed to work like described in the documentation.

 

Regards,

TCIII

Link to comment

Tom,

 

I'm going to take a wild-guess at something here... based on my own programming experience and how the unRaid array has occasionally acted for some folks.

 

The issue with replacing a small drive with a larger one, and where it works for you Tom, but not for Richard, almost sounds like it could be related to a class of coding error where memory is allocated, but not properly initialized before use.  In some systems, depending on the specific ram, memory used by programs might start out with zeros.  In others, the contents could be random and result in the error Richard is seeing. It might explain why the error does not show up in your test machine.    Back when I was coding in "C" on AT&T based UNIX there was a command "lint" that would check and report on this class of error.  In other words, it would warn developers of variables referenced before their contents are initialized and/or set.  I remember having to do "memset" on allocated memory prior to using it to ensure its contents were in a known state.  In other words, perhaps no error actually occurred, and nothing explicitly set the disk "invalid" flag, but it was implicitly set because it originally had a non-zero value when allocated and nothing initialized it to zero in your code.

 

I think I may have seen evidence of this myself the last time I re-booted my unRaid server when I added a UPS, but in a slightly different fashion. After adding the UPS I re-started my unRaid array and it started my small "check_unraid.sh" script to monitor its status. (the shell script was posted to the AVS thread months ago)  The script then reported that my unRaid array had experienced a failure.  Looking at the management page showed nothing odd.  Typing "cat /proc/mdcmd" showed the normal status values, but also showed segments of the same check_unraid.sh shell script interspersed with the status values... including the "grep command" line from my shell script within it scanning for abnormal disk status    Now, it was looking for specific status values and found them (it found itself), it reported an error, even though one had not occurred.  It almost seemed that memory used by the virtual file "mdcmd" under the /proc directory had not been initialized when it had originally been allocated.

 

Now, I can't really know if this is what is happening, but it sure would explain why some people have this issue and others do not.  If your management utility opens /proc/mdcmd to read the array status it might be fooled into thinking the array is invalid if it does not find the values it expected.  It might also explain why it works when the array is reset. 

 

Again Tom, if I'm off base and you already checked for this class of bug, ignore this post and keep looking.  I know I have accidentally made this class of coding error (at least once or twice in my programming carreer ;)) and have occasionally found errors like it in my code reviews of developers work on my project teams over the years.  I figure if nothing else, it might give you an idea or two if nothing shows up in the syslog Richard sends to you. 

 

Joe L.

 

Link to comment

UPDATE - Ran yet another test for Tom and it turns out that the system was working as designed all along  ::)

 

The issue for me (and maybe the others) is that the management screen is not at all intuitive and when a disk is marked as "Invalid" I thought it was actually .....well invalid.  In reality what was taking place is the parity check that was underway when completed would change the new disk status from "Invalid" to "OK", which I have now found it does.  As Tom puts it this is one of those issues where the mind of a software developer and the mind of an end user don't see things exactly the same way.

 

Tom what I would suggest is that during that parity check when a disk upgrade has been done mark the new disk with the word "Upgrading" while the parity check is going.  But most importantly you really need to update your website instructions to explain this far more clearly as I'm sure many others would have done just as I did and stop the parity check thinking that for some reason the new disk was actually "Invalid" or seen as bad by the system.

 

So the end of the 3rd test is the parity check has been let to finish and indeed as Tom tells me the new disk does get marked as "OK" and all it normal again.

Link to comment

Tom what I would suggest is that during that parity check when a disk upgrade has been done mark the new disk with the work "Upgrading" while the parity check is going.  But most importantly you really need to update your website instructions to explain this far more clearly as I'm sure many others would have done just as I did and stop the parity check thinking that for some reason the new disk was actually "Invalid" or seen as bad by the system.

 

So the end of the 3rd test is the parity check has been let to finish and indeed as Tom tells me the new disk does get marked as "OK" and all it normal again.

This is good news... the fix is far easier than some obscure programming bug.

 

I might offer that using the words "Rebuilding Data on Disk" or "Restoring Data on Disk" might be more correct than "upgrading" since under many conditions you will be replacing a failed drive with one of the same size.

 

Joe L.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.