VM crash, cache disk unmountable

Stu · February 26, 2020

Hi all,

Having recently found the time to dip my toes into the world of VMs and GPU pass through, despite some success I've come across a rather serious issue.

Twice now I have had the server crash, requiring a physical reboot. Once unraid has booted it reports the cache disk is unmountable, something to do with a bad partition layout. The only way I've been able to recover is by formatting the cache disk and recovering from what back ups I had.

The first crash occurred with my Win 10 VM, while running Geekbench 5.

The second crash occurred with a linux VM, ElementaryOS, that I was trying to force shutdown to validate the AMD reset bug scripts I had setup worked.

Diagnostics attached which I pulled after rebooting following the second crash. Unfortunately with the crashes I did lose the VM xml templates.

I'm wary of proceeding further as each crash does take me a while to recover from. Appreciate any insight!

tower-diagnostics-20200224-2220.zip

JorgeB · February 26, 2020

The "invalid partition layout" errors suggest something is happening physically to the cache device, changing the partition from the expected, this should never happen just because of an unclean shutdown, do you have another SSD you could test with?

Stu · February 26, 2020

3 hours ago, johnnie.black said:

do you have another SSD you could test with?

Thanks for replying!

Unfortunately I don't, although a larger cache drive has been on my wishlist for a little while and could be on the cards in the next month or two.

That said I'm not convinced of a fault with the SSD, as I have not experienced any problem with it under any other circumstance.

My hope is I'm misconfiguring something somewhere somehow that might have been identifiable from the diagnostics attached to my original post.

My only other thought is maybe my PSU is not keeping up, it is only a cheap 550w PSU, but not sure how to check for that.

If nothing else I may have to try to setup logging so that it survives the crash, and see if those results shed any light.

JorgeB · February 26, 2020

26 minutes ago, Stu said:

If nothing else I may have to try to setup logging so that it survives the crash, and see if those results shed any light.

That might help, but unless you're trying to pass-trough the SSD to a VM can't see how it can be a configuration issue.

Stu · February 28, 2020

I managed to trigger the VM instability / crash. This time the array was impacted and not the cache disk.

I've attached the syslog, which I did have to truncate as it was >300mb due to "Disk read errors".

It occurs immediately after I run user script to resolve the AMD Reset bug:

The syslog snippet:

Feb 27 15:50:07 Tower kernel: iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=29:00.0 address=0x000000040e06b4f0]
Feb 27 15:50:07 Tower kernel: iommu: Removing device 0000:29:00.0 from group 15
Feb 27 15:50:07 Tower kernel: iommu: Removing device 0000:29:00.1 from group 15
Feb 27 15:50:07 Tower kernel: PM: suspend entry (deep)
Feb 27 15:50:17 Tower kernel: PM: Syncing filesystems ...
Feb 27 15:50:17 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x7800000f SErr 0x0 action 0x6 frozen
Feb 27 15:50:17 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Feb 27 15:50:17 Tower kernel: ata2.00: cmd 60/70:00:28:76:b2/01:00:99:00:00/40 tag 0 ncq dma 188416 in
Feb 27 15:50:17 Tower kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Feb 27 15:50:17 Tower kernel: ata2.00: status: { DRDY }

And then the errors cascade from there (view attached syslog)

The script:

#!/bin/bash
#
echo "disconnecting amd graphics"
echo "1" | tee -a /sys/bus/pci/devices/0000\:29\:00.0/remove
echo "disconnecting amd sound counterpart"
echo "1" | tee -a /sys/bus/pci/devices/0000\:29\:00.1/remove
echo "entered suspended state press power button to continue"
echo -n mem > /sys/power/state
echo "reconnecting amd gpu and sound counterpart"
echo "1" | tee -a /sys/bus/pci/rescan
echo "AMD graphics card sucessfully reset"

After rebooting, disk 3 was marked as having issues.

I removed it from the array, and put it back in before kicking off a parity check which rebuilt it and completed without error.

Current diagnostics attached too.

For now I'm just going to avoid using that script and just have to accept a full reboot is necessary when that bug occurs.

syslog.zip tower-diagnostics-20200228-1303.zip

JorgeB · February 28, 2020

The important part is just above what you posted:

Feb 27 15:49:48 Tower kernel: iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=29:00.0 address=0x000000040e06b3f0]
Feb 27 15:49:48 Tower kernel: ahci 0000:03:00.1: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c9000000 flags=0x0050]
Feb 27 15:49:48 Tower kernel: ahci 0000:03:00.1: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000c9000680 flags=0x0050]

03.00.1 is the SATA controller, it stopped responding so all devices there connected will have issues, I've seen this other times with AMD based servers, bios update might help.

Stu · March 2, 2020

Thanks for pointing that out. I have since found a couple of other threads with similar issues that you've assisted with.

Bios is latest already, 2019-12-03, and the issue seems to have existed for a while so I'll just keep going as I am. Hopefully it doesn't occur too frequently under normal usage otherwise might have to do away with my virtualisation plans.

Thanks again.

VM crash, cache disk unmountable

Recommended Posts

Stu

Link to comment

JorgeB

Link to comment

Stu

Link to comment

JorgeB

Link to comment

Stu

Link to comment

JorgeB

Link to comment

Stu

Link to comment

Join the conversation