[SOLVED] WTF just happened?

brandon3055 · May 11, 2020

So my nas consists of 2 120GB cache ssd's connected directly to the onboard sata controller and 6 3TB HDD's connected via a reflashed LSI SAS9220 card.

Its been running flawlessly for several years now. A couple weeks ago i upgraded it from an FX-8320 to an i7-4770S and it continued to run flawlessly... until just now.

So here is what happened.

As far as i could tell everything was running fine before this happened.
I went to start up a VM that i have not used since the upgrade and unraid complained that the installation iso was not available (that should have been my first hint that something was wrong)
But i didn't think much of it at the time so i just went into the vm settings and removed the iso as its no longer needed.
Then as i pressed update i noticed unraid had automatically assigned the LSI card to the vm as a passthrough device... The same LSI card that the drives are attached to. It took me a second to realize because normally that isn't even a passthrough option.
I tried to unassign the device but at this point i was no longer able to make changes to the vm. The update process would just hang indefinitely.

Around this time i got a notification that "6 drives in the array have errors"

So i gave up on the vm and tried to stop the array. But the stop did nothing.

So next i disabled auto start in disk settings and told unrad to reboot which did work.

So thats what happened. I still don't understand it but now i need to deal with the aftermath...

After the reboot everything "seemed" fine so i started the array. Unraid decided this would be a good time for a parity check so i immediately canceled that because i was still unsure of the state of the array.

I then went back to the vm to unassign the LSI card and found that the LSI card was no longer even a passthrough option which confirmed that it never should have been an option in the first place.

I then tried to start the vm again only to find that the vm image was not available. It was at this point i discovered 2 of my drives had been kicked from the array.

one of my parity drives and my first data drive which just so happens to store my vm images among other things.

I tried to view the contents of drive 1 via the unraid ui to see if the contents were emulated but all i got was an empty folder.

So next i rebooted unraid again after which the disk appeared to be successfully emulated. I decided to check the contents of the shares to make sure everything was there. I started by opening the movies folder in my media share (which is actually on a completely different disk) but what i got was the contents of a completely different folder. I'm not sure if it was a folder in the same share or from a completely different disk.

I checked the same folder (movies) via the unraid web ui and the contents looked normal.

So i figured maybe it had something todo with the fact that i had some of the shares manually mounted in linux before i rebooted unraid (and they were still mounted)
So i unmounted those locations and rebooted unraid again. As far as i can tell that fixed the problem. I have not checked every share my my movies folder now contains movies again.

And so that's where i am at. I can now theoretically re assign those 2 drives and rebuild them but given everything that just happened i have decided to just stop the array and seek some professional help before i start rebuilding disks. At the very least i would like to somehow mount disk 1 and backup its contents first if that's possible. I have attached diagnostics from after the initial reboot as well as the diagnostics as of right now. I'm not sure if you need both but as i have them i figured it couldn't hurt.

I initially thought this was caused by unraid assigning the LSI card to the vm but considering the vm initially failed to start because of a missing file i'm guessing disk 1 was gone before i tried to start the vm. Unless the vm tried to pull the lsi card from unraid when i initially tried to start it. I just have no idea why unraid tried to give the LSI card to that vm...

nas-diagnostics-20200511-1306.zip nas-diagnostics-20200511-1428.zip

brandon3055 · May 11, 2020

So i think i know exactly what happened. Before the system upgrade i had a pci network card passed through to the vm. I removed that card when i did the upgrade. And i'm guessing the LSI card's new pci id after the upgrade just happened to line up with the id of the old network card... So when i tried to start the vm it did try to pull the LSI card away from unraid which obviously caused all hell to break loose. And it makes sense that disk 1 was affected because disc one is where i store appdata, system, etc so it would have been active at the time and by extension parity would have also been active.

This leads me to question the integrity of the remaining "good" parity disc which is being used to emulate disk 1.

Now that i am pretty sure i know what happened i think my best bet is to backup the contents of disk one (if someone can tell me how to do that) then rebuild the disk and compare it to the backup.

But before i do anything i would like a second opinion.

JorgeB · May 11, 2020

In this case I would do a new config and re-sync parity, still make sure actual disk1 is mounting correctly before doing it, you can do that with UD, disk1 must unassigned and array stopped.

brandon3055 · May 11, 2020

When you say do a 'new config' what exactly does that mean? I did some digging and i have pieced together a few things but i never found a definitive set of instructions explaining the process. From what i have gathered it sounds like i would just backup my license key then essentially do a fresh install of unraid. Then as long as i re assign all my disks to the same slots my data would be saved and parity would be rebuilt.
But if i'm not mistaken that would also nuke pretty much everything else on the system docker, vm's, shares etc. So what exactly are you suggesting i do?

I currently have disk 1 mounted read only and am backing up its contents which looks like it will take a while so i have some time to figure this out.

JorgeB · May 11, 2020

2 minutes ago, brandon3055 said:

From what i have gathered it sounds like i would just backup my license key then essentially do a fresh install of unraid.

No.

First:

1 hour ago, johnnie.black said:

make sure actual disk1 is mounting correctly before doing it, you can do that with UD, disk1 must unassigned and array stopped.

Then:

-Tools -> New Config -> Retain current configuration: All -> Apply

-Check all assignments and assign any missing disk(s) if needed.

You can even check "parity is already valid", then just run a correcting check since a few sync errors are expected.

brandon3055 · May 11, 2020

Yea i figured out i was barking up the wrong tree about 2 minutes after my last post when google turned up a mention of the "New config utility"

I guess there is probably no point finishing this backup before i attempt this. Especially since rsync is currently estimating 3700 hours for a single 90G file.
Here's hoping that not due to an issue with the drive...

brandon3055 · May 11, 2020

Ok so before i attempt to mount this. Does this look normal to you? I expected file system to be xfs not zfs_member
bac26

I did get an odd error when i tried to mount the drive from the command like in order to back it up.

root@NAS:/# mount /dev/sdf1 /mnt/recovery
mount: /mnt/recovery: more filesystems detected on /dev/sdf1; use -t <type> or wipefs(8).

As a result i had to specify the file system manually in order to mount the drive.

root@NAS:/# mount -r -t xfs /dev/sdf1 /mnt/recovery

Is this normal or am i looking at possible file system corruption?
Edit: This drive was part of a ZFS pool before i switched to unraid and reformatted everything.

Edited May 11, 2020 by brandon3055

JorgeB · May 11, 2020

4 minutes ago, brandon3055 said:

Does this look normal to you?

Not normal but likely the disk was used with zfs before, xfs partition is still mounting correctly, check contents look correct, don't forget to unmount before the new config.

brandon3055 · May 11, 2020

As far as i can tell everything is there but its really impossible to know for sure. I'm wondering if it would be a good idea to start the array one last time and backup the "emulated" contents of the disk. because i'm guessing that wont be an option once i run new config.

JorgeB · May 11, 2020

3 minutes ago, brandon3055 said:

I'm wondering if it would be a good idea to start the array one last time and backup the "emulated" contents of the disk.

There's no reason to think that any data is missing/damaged but It's up to you, though IMHO instead of doing that a better option would be to rebuild to a spare disk.

brandon3055 · May 11, 2020

1 minute ago, johnnie.black said:

There's no reason to think that any data is missing/damaged but It's up to you, though IMHO instead of doing that a better option would be to rebuild to a spare disk.

As a temporary "in case this does not work" option or as a permanent solution to my problem? Because I do have a spare 3TB drive but i have other plans for that drive.

JorgeB · May 11, 2020

As a way to be able to compare the data on the rebuilt disk vs old disk, then if all OK you can rebuild again on the old disk (or do the new config).

brandon3055 · May 11, 2020

Ok rebuild is in progress. I guess i will get back to you in around 4 to 7 hours to let you know how everything went.

Thank you for all the help!

JorgeB · May 11, 2020

Note that you'll need to change the xfs UUID to be able to mount both disks at the same time, you can do that for the disk outside the array with:

xfs_admin -U generate /dev/sdX1

brandon3055 · May 12, 2020

Alright so the rebuild completed successfully and it looks like i managed to avoid any data corruption.

I was not able to use UD for any of this due to the fact it sees my drives a zfs devices and there for can not mount them but i was able to do everything via the command line. In case anyone finds it useful here is a summary of what i ended up doing.

First rebuild disk1 onto a spare drive. Then i shut down the array and unassigned the 'new' disk1
Next i regenerated the UUID for the new disk1 using the command provided by johnnie.black and mounted both the old and the new disk1 in read only mode like so.

root@NAS:/# xfs_admin -U generate /dev/sdf1
root@NAS:/# mkdir /mnt/new-disk1
root@NAS:/# mkdir /mnt/old-disk1
root@NAS:/# mount -r -t xfs /dev/sdf1 /mnt/new-disk1
root@NAS:/# mount -r -t xfs /dev/sdg1 /mnt/old-disk1

Now to compare the contents of the drives i used rsync in dry run mode. I did this in 2 passes. The first just used the default compare function which is based on file time stamps. This gave me a list of files that were most likely modified by normal processes while the disk was being rebuilt. But it was unlikely to identify corrupt files. Its important to note the rsync direction is important old -> new because if there are any files missing completely tthey will be missing from the new drive and as a result would be ignored in a new -> old rsync.

root@NAS:~# rsync -hanP /mnt/old-disk1/ /mnt/new-disk1/ 
sending incremental file list
./
appdata/binhex-delugevpn/deluge-web.log
appdata/binhex-delugevpn/deluged.log
appdata/binhex-delugevpn/supervisord.log
appdata/radarr/nzbdrone.db
appdata/sonarr/logs.db-shm
appdata/sonarr/logs.db-wal
appdata/sonarr/nzbdrone.db
system/docker/docker.img
system/libvirt/libvirt.img

Everything there looks normal. These are all files i would expect to be modified while the array is running.
Now for the second pass i used rsync in checksum mode compares the entire contents of each file. This WILL detect if any files have been altered due to corruption. So what i was hoping to see with this command is the exact same output as the first command. Because this indicates that all of the detected changes are 'normal' changes and unlikely to be caused by corruption. Fortunately this is exactly what i got.

root@NAS:~# rsync -hancP /mnt/old-disk1/ /mnt/new-disk1/ 
sending incremental file list
./
appdata/binhex-delugevpn/deluge-web.log
appdata/binhex-delugevpn/deluged.log
appdata/binhex-delugevpn/supervisord.log
appdata/radarr/nzbdrone.db
appdata/sonarr/logs.db-shm
appdata/sonarr/logs.db-wal
appdata/sonarr/nzbdrone.db
system/docker/docker.img
system/libvirt/libvirt.img

It's worth noting this second pass took several hours to complete as it had to read every file on both disks.

This indicates that both the old drive and the new drive are most likely completely intact with no file corruption. If ether or both drives had random corruption i would expect to see other files in this list. And if that was the case i would need to inspect both versions of each file determine which of the disks has the most valid data and figure out where to go from there.

So at this point i knew all of my data was most likely intact and i had 2 option. Ether use new config to 'import' the old disk 1 then update parity my good parity disk followed by a rebuild of my second parity disc (at this point i dont think it would be worth importing)
Or simply repeat the rebuild process (which i now know works) and rebuild onto the old disk 1. Followed by a rebuild of my second parity disk.

I chose option 2 as its the option that i now know should work and if it does not i now have a backup of disk1.

It's worth noting that if i did not need the new drive for another system i could have just left it in place and kept the old disk 1 as a spar.

JorgeB · May 12, 2020

Thanks for reporting back, if you don't mind I'm going to tag this solved.

[SOLVED] WTF just happened?

Recommended Posts

brandon3055

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

Join the conversation