Rebuild Issue!!!


Recommended Posts

Hi all,

 

A little bit of a panicking UnRaid newbie over here and need some help from the experts!

 

I have been in the process of building up my UnRaid system to replace my old hackintosh build for video and photo editing work including 5 high capacity hard drives (1x12TB, 2x10TB, 2x8TB) and a 1TB NVME cache drive on a Ryzen 3900X B450 ITX system with 64GB ram.  Today after doing a full parity check over the last couple days, had to migrate my system from using a two port NVME SATA port card to a proper five port unit that passed through the actual drive IDs unlike the original which didn't as it could only operate either in hardware RAID or amassed JBOD which had one of my 8TB HDDs plugged into.  I swapped out the unit and re-ran all my cables and on boot-up as expected, UnRAID didn't recognize the now correct Seagate drive ID of the rewired 8TB drive, so I reselected it in to the same place in the array for rebuild and started the process.

 

All looked well with it parity sync going at 150MB/s, so I then went to start up my MacOS VM that lives predominately on my cache drive (primary virtual disk location), after boot-up I noticed that for some odd reason my sync was now paused, so I unpaused it.  Suddenly the sync said it was running at 22GB/s and took off chugging through up to 10% rebuild which made no sense for having it running for only 5 minutes at this point, so I immediately paused the sync again and hard stopped the virtual machine.  I then tried cancelling the rebuild and taking the array offline and restarting the system, hoping for the option to restart the parity sync, which on boot this time did show the emulated drive orange triangle next to the drive, but when restarting the array didn't offer to rebuild and now showed the drive as unmountable and my array usage from my actual 83% now up at 91%.

 

Reading through the forums for an answer I saw it noted that to allow me to re-add the same drive to the missing disk as if for fresh rebuild, I should unassign the disk and restart the array without it to unmatch the name and then pull it back off-line to re-add it back again, I held my breath and followed those steps.  Now however instead of having the option to re-add it like in a parity sync operation it is showing a blue square of a new device and "Too many wrong and/or missing disks!" in the array start box with it selected and an red 'X' on my 12TB parity drive listing it as disabled.  Rebooting the system doesn't change this.

 

How do I get back to the start of this last stressful hour and tell the system to rebuild from the parity data on the 8TB drive?  PLEASE HELP!!!

Edited by TheTechnoPilot
Added note on parity drive
Link to comment
10 minutes ago, johnnie.black said:

Diags are after rebooting so not much to see, also SMART is disable for the parity disk, re-enable parity and start the array, then post new diags after that so we can see SMART report or if you want wait until/if there are issues and post them at that time.

I just did as you outlined and when I went to reselect the parity after stopping the array, it is now listed as a new disk (blue) as well and now the array also won't restart like it first did with only drive 2 (my original problem) missing and now without even having it selected and still listed as unassigned!  Please note that my issue NEVER was originally a problem with my parity drive.  Please tell me I am not in an even worst boat now as it is feeling.  I have attached the new diagnostics.

coultonstudios-diagnostics-20200103-1627.zip

Link to comment
22 minutes ago, johnnie.black said:

Diags are after rebooting so not much to see, also SMART is disable for the parity disk, re-enable parity and start the array, then post new diags after that so we can see SMART report or if you want wait until/if there are issues and post them at that time.

To be clear, I unselected the parity disk from the array and restarted it, to then hopefully re-add it again confirming valid parity.  Not the New Configuration method, which now looking at specifically says it will make recovery of a failed disk, what I am essentially (because I was rewriting the disk) am trying to recover from, impossible.

Edited by TheTechnoPilot
Link to comment
5 minutes ago, johnnie.black said:

My bad, I missed disk2 was disabled, at least now there's SMART for parity, and it looks fine, all post the instructions to re-enable parity and attempt to rebuild disk2 in a minute, was this the original disk2?

 

Device Model:     ST8000DM004-2CX188
Serial Number:    WCT07A05

It is the original disk yes, but unfortunately it has already had a rebuild started on it that I cancelled when it took off claiming it was rebuilding at 22GB/s which was truly impossible across even a PCI interface, let alone SATA.. As I understand, I need to bring the array back online with the other disks, as just a parity rebuild on ST8000DM004-2CX188_WCT07A05

 

Oh I will just note that it originally needed a rebuild as mentioned, only because the device ID changed when I switched the underlying SATA controller on that disk to one that properly reported its ID.  Perhaps I should have come on here first to see if one of you wizards could have told unraid to ignore the ID and trust the disk was the same.

Edited by TheTechnoPilot
for clarity on situation
Link to comment

That disk is showing a recent UNC @ LBA error (read error), so it might not be all good, though these errors can sometimes be intermittent, but if you think original data is no longer valid you can try rebuilding on top again, if the rebuild fails again you'll need a new disk.

 

Before rebuilding on top it might be a good idea to try and mount old disk2 with UD to see if it's still mountable or not, and if it is and there's some data it would be best to use a new disk for the procedure below.

 

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s) including old parity if needed and old or new disk2
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 2 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk2 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check.

 

 

Link to comment

Thanks johnnie.black!  I did have one minor smart report on that disk in the past admittedly which is probably what you see in the log, but was minor enough I felt to ignore for the time being.

 

So before I go through with this procedure, first off yes, disk2 still shows as mountable (does succeed in mounting in UD) and does have the generally correct used space.  However currently I don't have a spare disk to throw in the slot that is large enough that isn't already an almost complete backup for the data on disk2.  I would think that in the circumstance, I couldn't trust the validity of the data on the drive if is started parity operation, or because it was all valid data there being replaced identically with itself it would remain valid...(though the sudden super fast rebuild makes me question)

 

I also want to check your thoughts on why as I described in my initial post:

21 hours ago, TheTechnoPilot said:

All looked well with it parity sync going at 150MB/s, so I then went to start up my MacOS VM that lives predominately on my cache drive (primary virtual disk location), after boot-up I noticed that for some odd reason my sync was now paused, so I unpaused it (was at most a 1-2% at that point).  Suddenly the sync said it was running at 22GB/s and took off chugging through up to 10% rebuild which made no sense for having it running for only 5 minutes at this point, so I immediately paused the sync again and hard stopped the virtual machine.  I then tried cancelling the rebuild and taking the array offline and restarting the system, hoping for the option to restart the parity sync

Was booting a VM that mainly lives on the cache drive a mistake that UnRAID tried to protect me against by pausing the parity/sync?  I kinda want to understand what happened before bringing things back online if that makes sense.

Link to comment
26 minutes ago, TheTechnoPilot said:

However currently I don't have a spare disk to throw in the slot that is large enough that isn't already an almost complete backup for the data on disk2.

SO if I'm understanding correctly you already have a backup of disk2, or at least the most important data, if that's correct then yeah try rebuilding on top, if needed you can also try to backup any missing data using the old disk with UD.

 

28 minutes ago, TheTechnoPilot said:

All looked well with it parity sync going at 150MB/s, so I then went to start up my MacOS VM that lives predominately on my cache drive (primary virtual disk location), after boot-up I noticed that for some odd reason my sync was now paused, so I unpaused it (was at most a 1-2% at that point). 

I don't see how the VM would cause interference with the rebuild unless some hardware is being wrongly passed-trough, for example a disk controller, and when when the VM started it would stop being available to Unraid cause issues with the disks, other than that can't think of another reason, but that's why it's important to always save the diags before rebooting/shutting down.

Link to comment
21 minutes ago, johnnie.black said:

SO if I'm understanding correctly you already have a backup of disk2, or at least the most important data, if that's correct then yeah try rebuilding on top, if needed you can also try to backup any missing data using the old disk with UD.

Would there be any chance though what I am copying off has been corrupted during the attempted rebuild though?  I would assume so...

Link to comment
5 minutes ago, TheTechnoPilot said:

Would there be any chance though what I am copying off has been corrupted during the attempted rebuild though? 

Difficult to say for sure, the rebuild being cancelled might not a big deal, if the disk was being rebuilt with the same data as it was there before, i.e., if nothing was written to the emulated disk, canceling the rebuild by itself wouldn't be a problem, since anything before and after that sector would be as it was, maybe that file could be corrupted, now if something really happened during your rebuild, like errors on other disks, then yeah that data would be corrupt.

Link to comment

Thank you for you help johnnie.black!  I decided to run out today and buy a new 10TB to replace that 8TB for this, especially since the 8TB is still mountable.  Your procedure I am so happy to say has my array back started and rebuilding disk2 on this new drive at a fairly sustained 185MB/s and look forward to hopefully reporting back in success no later then Sunday morning!  My only regret is that I obviously don't get to run my normal check of the new drive (I actually use another test method from a program within MacOS which doesn't just write out the drive but does two full-write and then read-outs plus a sustained random access test).  Oh well, guess this will partially test it.

 

Oh BTW, yes the system showed disk2 in this case as unmountable.  For protection sake have all dockers shut-down, not starting any VMs and only letting the system focus on the rebuild this time!  NO RISKS!

Edited by TheTechnoPilot
Link to comment
On 1/3/2020 at 1:43 PM, johnnie.black said:

Difficult to say for sure, the rebuild being cancelled might not a big deal, if the disk was being rebuilt with the same data as it was there before, i.e., if nothing was written to the emulated disk, canceling the rebuild by itself wouldn't be a problem, since anything before and after that sector would be as it was, maybe that file could be corrupted, now if something really happened during your rebuild, like errors on other disks, then yeah that data would be corrupt.

So, frustratingly, while the parity/sync has completed, drive2 is still showing up as unmountable/no file system...  Could you take a peak johnnie.black and let me know what's the situation?  What's my options here, is there anything that I can do?

coultonstudios-diagnostics-20200104-1942.zip

Link to comment
13 hours ago, johnnie.black said:

That's expected, if it was unmountable in the begging of the rebuild it would be the same in the end, you need to check filesystem on disk2:

 

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

 

Remove -n or nothing will de done and if it asks for -L use it.

Okay great!  That got it mountable, though sadly all the files are strewn about in the lost+found folder...(none seem to have really survived outside that).  The first log is the diagnostic file from that.

 

So my solution was to copy over all my backups, and then the files I need to from the the original disk2, however I ran into a very weird issue.  With the array started, the original disk2 wouldn't succeed in mounting, so I decided to restart the system and try mounting it first.  When I did that the disk mounted fine, but now when I then started the array, the new disk2 showed up as unmountable.  To check and see if they are related, I shutdown the array and unmounted the original disk2 (note that this one is always being mounted from UD), then restarted the array and all drives showed up fine!  I have included a second diagnostic from this reboot and weird symptoms as well, can you please take a look and tell me what is going on there?  Do I have any reason not to trust the array?  Or is it just some type of collision with the old disk2 and is there anyway to mount it in UD without causing it so I can transfer files from the old onto the new disk?

 

EDIT - Note that my own trouble shooting investigation seems to suggest it is because both drives share the same UUID?  Is there anyway to change this on the old disk2 to allow it to be mounted after starting the array or blocking the new disk2 from mounting when pre-mounted?

coultonstudios-diagnostics-20200105-2225.zip coultonstudios-diagnostics-20200105-2232.zip

Edited by TheTechnoPilot
Link to comment
10 hours ago, TheTechnoPilot said:

Note that my own trouble shooting investigation seems to suggest it is because both drives share the same UUID?  Is there anyway to change this on the old disk2 to allow it to be mounted after starting the array or blocking the new disk2 from mounting when pre-mounted?

Yes, UUID will be duplicated, you can change the one from the UD device with:

xfs_admin -U generate /dev/sdX1

 

Link to comment
On 1/6/2020 at 4:06 AM, johnnie.black said:

Yes, UUID will be duplicated, you can change the one from the UD device with:


xfs_admin -U generate /dev/sdX1

 

AMAZING!!!  Thank you for that final tip!  That quickly fixed the problem and am happy to say so far all files seem to have survived on the old disk and beginning to copy them all over, project by project, checking file integrities.

 

Also I have discovered what happened and why my VM trashed by rebuild, it seems that what had been my USB3.2 PCI-E device that I was passing through to MacOS on slot 1 became slot 2 when I installed the new SATA controller and that SATA controller which had been controlling the Parity Drive, Drive 1, and Drive 2 became slot 1 and got yanked from UnRAID when I started the VM and then no clue what happened when I unpaused it while it was still passed through.  Lessons definitely learned, guess we can also flag this as solved thanks to you!!!

 

Now my build challenge is I am just trying to get my VM back up and running as I decided I should start the XLM from scratch after that. :P

Edited by TheTechnoPilot
Link to comment
On 1/6/2020 at 4:06 AM, johnnie.black said:

Yes, UUID will be duplicated, you can change the one from the UD device with:


xfs_admin -U generate /dev/sdX1

 

OMG, more god-damn issues!!!  I rebooted my server and used the ControlR app to start the array, not realizing it seems my disk4 hadn't been detected I think and showed up as unmountable?  Now stopping the array of course it is recognizing it as new and wanting to do a parity rebuild on it...but the disk itself is definitely still mountable from UD.  johnnie.black, can you please walk me through the steps to re-add it back to the array without rebuilding it as it seems to be mounting fine outside of and definitely don't want to rebuild it to risk going through what I just did on disk2 (which I didn't end up using any of the rebuild data from and only from my backup and the original disk2).  Please see the attached log in case there is something I am missing!

coultonstudios-diagnostics-20200108-0458.zip

Link to comment

Note that the disk is showing a recent UNC @ LBA error, so it might have issues, it would be safer to rebuild to a new disk and keep that one intact for now, but if you want to re-enable it do a new config and trust parity, note that since the disk was mounted outside de array read/write you'll need to run a correcting parity check because a few sync errors are expected.

Link to comment
7 hours ago, johnnie.black said:

Note that the disk is showing a recent UNC @ LBA error, so it might have issues, it would be safer to rebuild to a new disk and keep that one intact for now, but if you want to re-enable it do a new config and trust parity, note that since the disk was mounted outside de array read/write you'll need to run a correcting parity check because a few sync errors are expected.

So...ugh it seems I am missing a bunch of data from the drive, is there harm if I pull it and put in a new drive for rebuild from the existing parity?  Am I just asking for issues doing that?  I really don't understand how I could be missing whole sections of data from the drive, literally it is like it is missing it's portion an entire share.  Approx 1.5TB worth.  I am now suspecting the new SATA cables I put in might be flaky as this drive wasn't on the new SATA controller but is on a new cable.  Either way, I am not going to do anything till I replace the cables to be sure.

Link to comment
5 minutes ago, TheTechnoPilot said:

So...ugh it seems I am missing a bunch of data from the drive, is there harm if I pull it and put in a new drive for rebuild from the existing parity?  Am I just asking for issues doing that?  I really don't understand how I could be missing whole sections of data from the drive, literally it is like it is missing it's portion an entire share.  Approx 1.5TB worth.  I am now suspecting the new SATA cables I put in might be flaky as this drive wasn't on the new SATA controller but is on a new cable.  Either way, I am not going to do anything till I replace the cables to be sure.

There were 8 hours between the post you quoted and your post. The post you quoted suggested some possible approaches. You don't mention if you actually did anything during those eight hours. Did you just examine the drive and discover you might be missing something, or did you do more than that?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.