Jump to content

Windows 10 VM randomly hangs under heavy load


Recommended Posts

This usually happens while compiling code.

The behavior is as follows:
- First, the compile will hang at a certain spot. The system is still responsive, but CL processes just spin forever
- In task manager, disk usage for the C drive is now stuck at 100%, though actual IO is generally fairly low at this point.
- Around this time, Windows will start complaining in the event log: "Reset to device, \Device\RaidPort2, was issued.". This happens frequently.
- Eventually, Visual Studio itself hangs, and the system continues to become less and less responsive until it requires manual restart.

You can't kill the stuck CL processes, so something's likely hung deep in the driver.

The VM has three disks:

    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
      <source file='/mnt/user/vms/Windows 10/vdisk1.img' index='2'/>
      <backingStore/>
      <target dev='hdc' bus='scsi'/>
      <boot order='1'/>
      <alias name='scsi0-0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
      <source dev='/dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z8NB0M305963H'/>
      <target dev='hdd' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>


The compile is happening on the NVME drive that's passed through at the bottom, but the error points to one of the above drives. I would suspect the first entry (the OS is installed on this one) given the error and cause, as the middle drive is entirely idle.

Ideas? For now I've copied the image to a raw NVMe device which appears to work around the problem, but this is obviously less than ideal from a scaling perspective.

 

As a starting point, I ran memtest overnight and it came back clear.

 

Hardware:
AMD Threadripper 1950x
Asus ROG Zenith Extreme
LSI Logic SAS 9207-8i

Nothing in the unraid logs (VM or system) correspond to the event.

Edited by Spitko
more details
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...