Jump to content

Consistent upgrade failures on the newest OS


Recommended Posts

I have done a bunch of testing over the last week and have come to the conclusion that the newest version of UnRaid (324) is not capable of a normal recovery of a failed disk or replacement disk, period.  Now it may have something to do with my overall configuration because there was one other time (fix18) where a 12 drive configuration would cause a failure.

 

In my latest test just yesterday I had a 250GB drive with 2 full movie rips on it that I wanted to upgrade to a new 300GB drive (raw not formatted).  Following the correct process to upgrade a drive all is fine until I check to box to allow the UnRaid to recover the data.  The system then just sits there and does nothing telling me that it's in data recovery mode. The drive lights are not flashing and the new drive is listed as disabled.  I emailed Tom about this several weeks ago but never got a response other than he was going to try to duplicate the issue.  Well I can duplicate it with ease, each and every time.

 

One interesting thing is that even though the old drive and it's contents are no longer in the system if I look at a windows machine the replaced drive is still accessable AND I can copy the files at will however it copies MUCH MUCH slower because it's re-building the contents using parity data not the real data, you can see this because as you're doing the copy ALL 12 drives are being accessed.

 

I'm sharing this with you all because I feel that my data is not totally safe in this situation and clearly the UnRaid is not working as it should.  I have not tested with an older version of the OS but I may give that a shot in the next day or two.

Link to comment

Any updates?

 

I'd really, really like to hear some outcome on this. I'm starting to worry that I've moved my data into a 12-drive ticking time-bomb  :-\  >:(

 

If UnRaid can't expand, and can't recover, this is very disturbing. If I had the gumption to take the risk, and a spare drive, I'd test it myself. As it stands I'm afraid to touch the array, given this report.

Link to comment

Any updates?

 

I'd really, really like to hear some outcome on this. I'm starting to worry that I've moved my data into a 12-drive ticking time-bomb  :-\  >:(

 

If UnRaid can't expand, and can't recover, this is very disturbing. If I had the gumption to take the risk, and a spare drive, I'd test it myself. As it stands I'm afraid to touch the array, given this report.

 

What version of the OS are you running....?  I'm runnin 324 which is the latest Beta that Tom provided to a few of us.  The reason I ask is that I'm not sure if this is a 324 issue or a 12 disk issue or a general UnRaid issue and as you know Tom has again gone unresponsive.

 

What you need to remember is in all but one of my failed upgrades I able to recover the data.  The one where I lost data was my fault, not understanding that although the data re-build failed the partity was still available.  This situation is FAR from acceptable and not at all what should be going on but you actually can copy the data from the failed drive (via the partity) over to another good disk.

Link to comment

I recently ordered an UnRaid dongle.  I got really busy and haven't been able to set it up yet.  i have my server built and was ready to get the software setup this week.    But in light of the problems with the current build and Tom's nonexistance, I'm tempted to just ask for my money back and come up with another solution.    Of course getting my money back from someone who doesn't wan't to respond to customer's requests is a bit problematic.  This is very frustrating and disappointing.    I sincerely hope Tom returns and makes this right, not just the software but the loss of faith with his customers.    I was really talking up this software to some friends who are looking for similar solutions, but now I think I'll have to take back my recommendations for now at least.

Link to comment

I have confirmed that the 324 release is not able to replace/upgrade an existing data disk. 

 

I did some experiments using my partially filled array, so it has nothing to do with rharvey's array being full.  It also has nothing to do with the disk size. My experiments were with two "small" 8 Gig drives.  It does have to do with the new disk not being partitioned (and possibly formatted) before being used for the upgrade.

 

When the 324 release starts on its process to rebuild a disk where replacement hardware is detected in a given slot it issues a "check" command to the unraid driver.  What I observed in my logs is a series of failed write attempts in increasing block addresses that eventually cause the disk to be marked as DISABLED. (after 256 failed attempts, it seems)  This I suspect was supposed to be the zeroing/clearing of the drive.  Then, when the actual sync of data is to occur, the unRaid driver looks for disks marked as "INVALID" to rebuild and finds none. (the disk is marked DISABLED, not INVALID, so it ignores it)

 

I do have some good news.  The original release of unRaid *will* rebuild/upgrade a new disk, provided it has not already been marked as "DISABLED" by a failed upgrade attempt using the 324 release.  It also ignored DISABLED disks.  Unfortunately, once a disk has been marked as DISABLED there is no way to re-enable it without going through a "Reset Array Configuration" and if you had a failed drive you would lose the data on that drive unless you had copied it somewhere else.

 

But.. the 324 release will ALSO rebuild/upgrade a disk, but I suspect that it must be partitioned first before you attempt to add it to the array. (It may also need a reiserfs filesystem)

 

I know this because the 324 release successfully upgraded disks for me if the replacement disk had already been partitioned and a file system created on it BEFORE the upgrade was attempted.  (I was experimenting swapping a pair of identical 8Gig drives in the same slot.  Adding the drive to the slot worked fine, replacing the first with the second failed (They had windows file systems on them originally, so when the 324 release went to clear the disk it failed, and marked the disk as DISABLED) After resetting the array configuration and re-adding the second as "new" it too was added successfully, then swapping the second with the first the upgrade was successful (the first already had been partitioned and a file-system created on it, so the upgrade succeeded.)

 

I think this is why Tom could not duplicate this bug in his lab, he was always using a disk that had already been partitioned/formatted.  We were using disks that were blank from the factory.  There must be a difference in the management utility in that the old one must partition/format the new drive before attempting the rebuild, the new 324 release apparently skips that step. I can probably look closer at the logs and output messages and see if I can detect the difference.

 

I need to do a lot more tests... unfortunately, it takes about 6 hours to  reset my array configuration each time I remove the new drive to start another test.

 

The good news is that this is fairly easy to partition and create a file system from a telnet prompt once I figure out just what needs to be done, that is... Or... for those of us with the old release, use it to do the upgrades/replacement of failed drives.

 

Tom, now that I've narrowed down how to duplicate this bug, care to get involved... This is a MAJOR bug... one that makes the 324 release very difficult to work with in the event of a failed disk. Unless you have someplace to copy the data on the failed drive you will lose data.

 

Any utility to mark a DISABLED drive as INVALID so it can try again to rebuild it... I mean, it could be a connector not seated correctly.  If it really was defective it would just get marked as DISABLED again.

 

Until this is fixed in 324, if a disk fails and if you have the "previous" release on your GRUB menu boot up on it when you physically install the replacement drive.  It will perform the upgrade. (It does not give good progress messages, but you will see the disk activity lights on the front of the server as it reads all the other disks to rebuld the one you are replacing.

 

For those of you who only have the 324 release... you are not so lucky...

 

Do not attempt to upgrade a disk using the 324 release without first partitioing the disk and creating a filiesystem on the replacement disk using fdisk and mkreserfs.

 

Joe L.

Link to comment

Hey all,

 

I just ran into this problem (or something similar) using the 1.050930 software. I have a "full" server (12 disks) and was replacing a 120GB drive with a 400GB drive. The upgrade apparently failed, and my status page now shows the new 400GB drive as disabled. I tried putting the original  120GB drive back in, but I got a "drive too small" error.

 

So, at this stage, I guess my only option is to put the 120GB drive back in and do the "reset array configuration"? This should just regenerate parity based on the current contents of the data drives, correct?

 

TIA.

 

-mike

Link to comment

Bummer. 

 

Yes, it sure sounds like the same issue as I thought was only in the 324 release.  Richard Harvey also reported he ha difficulty using the original release too.  (This would be a great time for Tom to pop in and say "I found it") 

 

If you had 120G worth of storage elsewhere on your LAN or on a different drive in your array you could copy the contents of the currently disabled drive to it first and then "Reset Drive Configuration" with the 400G drive in the slot.  It should detect it as new as ask you to format it, etc. Then copy the data back to the 400G drive.  I know of no way to change a drive's status from Disabled to Invalid and then have the unRaid array restore the original contents.  "Reset Drive Configuration" is about it.  Just don't lose your original contents in the disk shuffle.

 

Since the 120G drive is a working drive you might be able to use the reiserfs driver on one of your windows machines to installed the 120G drive there.  Then a "Reset Array Config" with the 400G drive to add it as a new drive, and then copy the contents from the old drive over the LAN o the 400G drive.  Slow, but the data will get there.

 

I'm going through a visual code walkthrough of the unRaid driver to see if I can find any obvious logic errors now that would cause intermittant write errors and a failure of a drive.  Don't have a clue if I'll be able to see anything, but hey, it is worth a try.  I originally suspected a variable that is not properly initialized, and thus works most of the time, but fails if a person's ram happens to initialize differently.  At leaset that is what I'm visually looking for. We know the basic logic works fine, as long as the disk is not marked as DISABLED.

 

Joe L.

Link to comment

Thanks for the info Joe. I'm not desperate for the extra space right now, so I might just put the 120GB drive back in until this issue is sorted out (fingers crossed). If I put the 120GB drive back in and "reset array configuration", I should be back where I started, right? It shouldn't erase any of my data drives, including the 120GB? I really don't want to have to re-rip 2.5TB of DVDs....

 

-mike

Link to comment

Could someone clarify the conditions under which a HDD upgrade/swapout will be successful under 324 please?  For Tomm not to have spotted this bug there has to have been conditions under which UnRaid can work succesfully.

 

If I:-

 

1/ Buy a larger drive

2/ Partition it under Linux usining Fdisk in a different Linux (Suse) machine

3/ Reiserfs format the partition under Linux

4/ Swap with a a smaller but healthy HDD in my current 12 HDD array

 

Will data be restored to the new drive and parity rebuilt?

 

Thanks,

 

Mark.

Link to comment

Thanks for the info Joe. I'm not desperate for the extra space right now, so I might just put the 120GB drive back in until this issue is sorted out (fingers crossed). If I put the 120GB drive back in and "reset array configuration", I should be back where I started, right? It shouldn't erase any of my data drives, including the 120GB? I really don't want to have to re-rip 2.5TB of DVDs....

 

-mike

Correct.  If you put the 120G drive back in and "Reset Array Configuration" you should be back to where you were. It will not erase any of your data drives.

 

Joe L.

Link to comment

Well I'm very disappointed that since Tom's return, he hasn't responded to this issue.  This needs to be fixed ASAP. 

Yes, but since Tom logged in the other day and responded to a few posts he has not yet logged in again (his series of posts were on the morning of the 4th, he has not logged in since that morning, so he probably has not read the past two days posts yet.)

 

I agree, it is a serious bug, but from what we are seeing, elusive and difficult to reproduce in the lab.  We really don't know what trips it, or what is involved.  There are reports of the same issue in the original 930 release now.

 

Joe L.

Link to comment

Could someone clarify the conditions under which a HDD upgrade/swapout will be successful under 324 please?  For Tomm not to have spotted this bug there has to have been conditions under which UnRaid can work succesfully.

 

If I:-

 

1/ Buy a larger drive

2/ Partition it under Linux usining Fdisk in a different Linux (Suse) machine

3/ Reiserfs format the partition under Linux

4/ Swap with a a smaller but healthy HDD in my current 12 HDD array

 

Will data be restored to the new drive and parity rebuilt?

 

Thanks,

 

Mark.

Suse User,

I wish I could say for certain... truth is, we still don't know for sure since I have not tried that series of steps and as far as I know nobody else has either.  All I know is when I replaced a drive with one already formatted the upgrade process worked on 324.

 

Tom has said it works in his lab.  I was guessing it was because he used a drive that was already partitioned, but we honestly don't know.

 

We do know that if you partition and format the new drive in a different machine and then copy the data from your old drive to it over the LAN and then install it and "Reset Array Configuration" it will recalculate parity on the new drive and its contents.

 

Let us know how it goes.  Please try the initial partition/format steps, swap the drive into the unRaid array and let us know what happens.  At worst case, you can put the smaller drive back in and "Reset Array Configuration" and be back to where you were.

 

Joe L.

Link to comment
Let us know how it goes.  Please try the initial partition/format steps, swap the drive into the unRaid array and let us know what happens.  At worst case, you can put the smaller drive back in and "Reset Array Configuration" and be back to where you were.
I don't have any larger drives, I need to order some.  I still think I'll wait a while longer before doing so to make sure everything will be OK.
 

 

Remember, I'm the person with the "shrinking" drive size, I'm worried enough about ending up with my parity HDD not being the largest again, experimenting on my live server to see whether I can replace a volume at the same time is not something I'm brave/daft (delete as applicable) enough to do right now.

 

I agree, it is a serious bug, but from what we are seeing, elusive and difficult to reproduce in the lab.  We really don't know what trips it, or what is involved.  There are reports of the same issue in the original 930 release now.
Perhaps Tomm's working to ensure it's fixed on the 2.6 kernel build he's currently working on, rather than spending the time fixing the older version?  I hope that's not the case as I'd imagine there's plenty of scope for new problems with the kernel change and I'm just craving a stable functioning UnRaid.

 

Mark.

Link to comment

Remember, I'm the person with the "shrinking" drive size, I'm worried enough about ending up with my parity HDD not being the largest again, experimenting on my live server to see whether I can replace a volume at the same time is not something I'm brave/daft (delete as applicable) enough to do right now.

 

Perhaps Tomm's working to ensure it's fixed on the 2.6 kernel build he's currently working on, rather than spending the time fixing the older version?  I hope that's not the case as I'd imagine there's plenty of scope for new problems with the kernel change and I'm just craving a stable functioning UnRaid.

 

Mark.

Mark,

You are not the right person to ask to experiment... Your cautious approach is correct in your situation.  We don't want anyone to lose their data. (defeats why we all purchased our unRaid arrays)  If the new drive was bigger it's  likely would have you go through a parity drive swap exercise and nobody has reported that does not work, but the parity drive swap is not something you want to do without a good reason.

 

Stay tuned on the sidelines... as you said, perhaps Tom will figure this out soon.

 

Joe L.

Link to comment

Update: Found and fixed this bug having to do with upgrading a disk.  The code is running through our test suite overnight and I'll post the new release tomorrow (that is, later today - Friday 7/7).

 

For Joe: problem was indeed partition table was not being written to new disk.  Our testing did not reveal this, even though we do a "

dd if=/dev/zero of=<target device>

" because the partition table is cached.  Solution in the test suite is to insert a "

hdparm -z <target device>

" command in there.  Solution in the code is a 2-line change.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...