SOLVED Hard Crash...disk error... looking for advice before proceeding


Recommended Posts

Unraid 6.6.7
Here is what I am experiencing:


VMs kept going to resume


Attempted to shutdown VMs, by hitting resume then quickly hitting stop.  One of the two VMs shut down.  When doing the same for the second VM, Unraid crashed.   No webgui, no ssh/telnet access.  Server is headless so I don't have keyboard or monitor (will in future).  So I was unable to get diagnostics for this original crash.  That being said I know my cache drive was over utilized as I had this happen once before (VMs going to resume) when a docker log filled the cache drive.  I also know a couple of the array disks had little free space and I know that can sometime be an issue.
Due to hard crash, power cycled Unraid.  Unraid came up and parity check started.  My two VMs that were running during the crash were no longer listed in my VM tab (my two other VMs that were not running at the time were the only two listed).  A handful of dockers were running, balance of them would not start.
Decided to reboot unraid via webgui to see if they would return.   It returned the exact same way only 2 VMs with handful of dockers.  Also note that it says parity check canceled, so I probably should not have done this as parity check was in progress.
Deleted docker.img and recreated with 10 or so of my key dockers from templates.  Worked with no issues.   Went to bed.
Got up this morning....read some forums and decided to run btrfs balance on cache drive.  While this was running, I was looking into VM issue.  Realized I had an older libvirt.img in another location and switched to this image.  Went to VM tab and saw a bunch of older VMs I had from lets say a year ago or so.  All VMs were stopped. 
I then went to my VM xml file backups and copied the VM xml for my main windows 10 VM, went to xml mode, pasted the backup xml file, unchecked start on creation and then created the new VM.   
I immediately got a disk 3 read error, followed by notification that disk 3 was disabled.
Went to btrfs balance and saw it was still running so I canceled it, stopped the array, went to settings to turn off auto start array, rebooted server and started array in maintenance mode.
After rebooting in maintenance mode, disk 3 shows up, has a smart report and in looking at the smart status stats, I don't see any issues.  I am currently running an extended smart test on this disk and am waiting for results.
I know this is a lot...any advice on how to proceed?    
 on cache drive.  While this was running, I was looking into VM issue.  Realized I had an older libvirt.img in another location and switched to this image.  Went to VM tab and saw a bunch of older VMs I had from lets say a year ago or so.  All VMs were stopped. 
I then went to my VM xml file backups and copied the VM xml for my main windows 10 VM, went to xml mode, pasted the backup xml file, unchecked start on creation and then created the new VM.   
I immediately got a disk 3 read error, followed by notification that disk 3 was disabled.
Went to btrfs balance and saw it was still running so I canceled it, stopped th

tower-diagnostics-20190506-0648.zip

Edited by goinsnoopin
Link to comment
Just now, johnnie.black said:

Disk3 and another unassigned 1TB disk dropped offline, likely a controller issue since both are in the same, reboot and post new diags.

 

Docker image errors were because cache was full.

 

Thanks johnnie I was awaiting your reply, least it helps with me scanning these log files. 

Link to comment

Johnnie....I am not passing any sata controllers to VMs.  The second disk that crashed at same time is an unassigned disk that has an image file that is mounted to one of the VMs that was running for my son's video game storage i.e. a d: drive.   The win10 VMs c drive is on the SSD cache drive.  Does this second thought impact your suggestion if rebuilding on same disk?

 

 

 

😄

Link to comment
8 minutes ago, goinsnoopin said:

..I am not passing any sata controllers to VMs.

I was trying to avoid going thought the VM diags, but yes you are:
 

-device vfio-pci,host=07:00.0,bus=root.1,addr=00.2,multifunction=on

This is the previously mentioned 2 port Asmedia controller.

Link to comment

Johnnie....I am confused.   I have 3 pcie cards that I pass through.   Two Nvidia GPUs (one to each VM) and one USB3 pcie card to one of the VMs.

 

I have no sata pcie cards in my case....however my motherboard ASRock Z97 Extreme6 does have an ASMedia ASM1061 sata 4 ports.

 

I haven't made changes to this in forever...back to my original post...I did restore an old libvirt....and diagnostics were run with this old libvirt....could that be the issue?

Link to comment
17 minutes ago, goinsnoopin said:

however my motherboard ASRock Z97 Extreme6 does have an ASMedia ASM1061 sata 4 ports.

It has two 2 port Asmedia controllers, one of them is being used by a VM, this can happen if hardware was added/removed that changed the PCI assignments, e.h. 07:00.00 could have been the USB controller before.

 

You can also see the kernel driver in use for the 1st Asmedia controller, vfio-pci confirming it's being used by quemu.

 

07:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)
    Subsystem: ASRock Incorporation Motherboard [1849:0612]
    Kernel driver in use: vfio-pci
    Kernel modules: ahci
08:00.0 USB controller [0c03]: Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10)
    Subsystem: Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100]
    Kernel driver in use: vfio-pci
09:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)
    Subsystem: ASRock Incorporation Motherboard [1849:0612]
    Kernel driver in use: ahci
    Kernel modules: ahci
0a:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142]
    Subsystem: ASRock Incorporation ASM1042A USB 3.0 Host Controller [1849:1142]
    Kernel driver in use: xhci_hcd

 

Link to comment

I think my plan is to verify emulated disk3 by starting array.   Then delete libvirt image and create VMs from backup XML files....but will double check pci assignments as hardware changes made have given them different addresses since backup.  Then if all looks good, will rebuilt disk3.

 

Link to comment

Disk 3 emulated contents look fine.

 

Deleted libvirt.img, tried to start my original win10 VM and get an execution error.

 

Operation failed: unable to find any master var store for loader: /usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd

 

Unassigned disk sdg no longer shows up.  Believe that was the drive that failed.  Libvirt was not on cache drive it was in  a share called system ...in looking at emulated contents this was on disk3.

 

Any suggestions...should I rebuild disk3 to get to a normal state then deal with VMs?

tower-diagnostics-20190506-1326.zip

Link to comment

I am not using Bobby_steam VM.   I removed all VM definitions in VM manager then created a new one called Kitchen PC.

However, you are right both drives dropped???

 

That forced disk3 to be unassigned on reboot.   Assume I should rebuild disk 3 and not touch any VM stuff until rebuild complete?

Edited by goinsnoopin
Link to comment

Don't understand that???  Don't think I attempted to start any other than Kitchen PC

 

At this point I want all VMs gone...will create new.  Is removing them from webgui enough.  Any other fragments elsewhere?

 

Disk3 became unassigned in last reboot, so assume I should assign and rebuild data and not touch any VMs until rebuilt?

 

Link to comment

Quick step back:
Before crash libvirt.img was in a share that was allowed to be on any disk.  looked like this data was on disk3.  

reading forums most people had libvirt.img on cache drive.  I then went to my cache drive and saw a libvirt folder, which contained a libvirt.img.   I went to vm settings and changed my path to point to the libvirt.img on the cache drive.   once I did this a bunch of old vms like bobby steam vm showed up.   these vms havent been used by me in years.  They may have been autostart but I cant tell as today I deleted the libvirt.img on cache drive then removed all vms from the webgui.  So the only vm in my webgui is called KitchenPC.  This is the one that wont start.   Do i have some corruption?  maybe I should make vm via form vs copying backup xml?  I did review template and it looked good, passed correct usb controller.  

Should I rebuild disk 3 first?

 

 

Link to comment

Johnnie,

 

How is the libvirt.img created?  Is it getting VM information from the flash drive?  If I delete libvirt.img, then turn VMs back on it keeps auto creating my 6 old VMs and the Bobby steam is one of them with the wrong SATA card passthrough...and it's set to auto start.    

Any ideas on how to break this circle?  I have rebuilt disk3 two times now.   I even verified libvirt.img was gone via telnet session and checked each disk.   At one point tried to edit bobby_steam VM to change pci address to the USB card....hit update and it changed to updating and never completed.

Link to comment

Johnnie,

 

I truly have tried both of those.  If I delete VMs(definitions only as I want to keep the disks)....they disappear from Unraid VM webgui, however on reboot they reappear and offending one auto starts....passing the incorrect SATA card then disabling the disk.

 

So I have used the PCi.ids to list the devices for passthrough.   Is there an opposite to that command to prevent this SATA controller from being passed to the VM while I figure this out.

 

At this point wondering if there is some corruption on flash drive from original hard crash... contemplating pulling USB flash and doing clean install.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.