Windows VM excessively slow and random server crashes after upgrade from 6.8.3 to 6.9.1

Crimson Unraider · April 3, 2021

I have 2 Windows VMs, one on cache drive and on on nvme (followed Space Invaders's guide), I normally only use the nvme but I left the cache one on for testing. Both worked fine in 6.8.3 and both are extremely slow after upgrade. I noticed in task manager the "System Interrupts" was using over 60% of my CPU randomly on both VMs. It is setup as a gaming VM using Nvidia 1660 TI with 16G of ram, I thought that the GPU passthrough may be the problem so I tried VNC and it was still slow.

I noticed this error while running GPU statistics plugin repeating and filling my log file so I uninstalled the plugin and the error stopped.

Quote

Apr 1 11:17:10 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 1 11:17:10 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs

I wasn't able to get the diagnostics after the random crashes but I did get the attached after running a VM.

My system log shows these errors after I start the VM (the errors repeat until I shut down the VM).

Quote

Apr 3 15:29:10 Crimson smbd[21858]: [2021/04/03 15:29:10.766144, 0] ../../lib/param/loadparm.c:801(lpcfg_map_parameter)
Apr 3 15:29:10 Crimson smbd[21858]: Unknown parameter encountered: "hide file"
Apr 3 15:29:10 Crimson smbd[21858]: [2021/04/03 15:29:10.766416, 0] ../../lib/param/loadparm.c:1841(lpcfg_do_global_parameter)
Apr 3 15:29:10 Crimson smbd[21858]: Ignoring unknown parameter "hide file"

I checked my Bios and it is the newest, HVM and IOMMU are Enabled

M/B: Gigabyte Technology Co., Ltd. X399 AORUS PRO-CF Version Default string

BIOS: American Megatrends Inc. Version F2. Dated: 12/11/2019

CPU: AMD Ryzen Threadripper 2950X 16-Core @ 3500 MHz

Memory: 128 GiB DDR4 (max. installable capacity 512 GiB)

crimson-diagnostics-20210403-1538.zip

John_M · April 3, 2021

23 minutes ago, Crimson Unraider said:

I thought that the GPU passthrough may be the problem

You can't pass through a GPU that's bound to a driver. Uninstall the Nvidia driver.

23 minutes ago, Crimson Unraider said:

I noticed this error while running GPU statistics plugin repeating and filling my log file

See this post and subsequent followups:

Edited April 3, 2021 by John_M

Crimson Unraider · April 3, 2021

I have 2 GPU's, I use my GTX 1070 with Plex and Emby. Can I stub the 1660 in vfio for use with my VM and keep the Nvidia driver for the 1070?

John_M · April 3, 2021

Yes.

Crimson Unraider · April 3, 2021

So I stubbed the 1660 and the Sanity check errors stopped.

Quote

Apr 1 11:17:10 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 1 11:17:10 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs

But it is still really slow and locking up. I think it might be network related, task manager keeps showing "System Interrupts" excessive cpu usage when the freezing happens. I googled it and "System Interrupts" usually means a hardware issue and most say it is likely caused by a nic or external device. The only things plugged in is usb keyboard/mouse and an Xbox controller wireless adapter. When I disable the windows network adapter I see less system interrupts but now I can't play most of my games.

I think I need to walk away for the night.

John_M · April 4, 2021

1 hour ago, Crimson Unraider said:

I think I need to walk away for the night.

Taking a break from a problem often helps. Meanwhile the diagnostics you posted earlier reveal that SSD KINGSTON SHSS37A480G (part of your cache pool) has cable problems. That is obviously affecting cache operation. I missed it earlier in the general noise. Shut down and check/replace the SATA cable and also check the power cable while you're there. Then power up and start the array, then post new diagnostics, which should be a bit tidier and easier to read.

Crimson Unraider · April 5, 2021

On 4/3/2021 at 9:13 PM, John_M said:

SHSS37A480G

John, I'm ready to get started on this again, I'm trying to find which drive has the cable problem, I have 4 of those drives in the pool. I'm pretty new to Unraid and I'm just curious where you found that.

Crimson Unraider · April 5, 2021

Ok I changed the cables on all 4 of the Kingston SSDs, see attached Diagnostics after boot. Also, my ssds are all in an icy dock, I checked the power connectors but there are only two feeding the 6 ssds. Also, while I had it open I added another nvme for the second cache.

crimson-diagnostics-20210405-0713.zip

Crimson Unraider · April 5, 2021

So, another update, the nvme I put in was reporting 57 degrees C. When I clicked on it to see the info page the server turned off. I removed the nvme and restarted, the system started a parity check due to unclean shutdown but I stopped it until after troubleshooting. I don't want it to crash in the middle of parity. When I pulled the nvme it was warm to touch but not hot. When I removed the nvme I put the samsung ssd in the main cache pool. I moved it earlier to try to separate libvrt from the other traffic when the slow down first started, that didn't help so I went back to one pool. I also noticed that it is taking more than 5 min to boot up now.

crimson-diagnostics-20210405-0846.zip

Crimson Unraider · April 5, 2021

It just shut down again

Crimson Unraider · April 5, 2021

I opened the log as soon as I could at boot and captured this before it shut off again.

Quote

Apr 5 09:54:47 Crimson kernel: eth0: renamed from vethc78c9d9
Apr 5 09:54:47 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethda3a3eb: link becomes ready
Apr 5 09:54:47 Crimson kernel: docker0: port 3(vethda3a3eb) entered blocking state
Apr 5 09:54:47 Crimson kernel: docker0: port 3(vethda3a3eb) entered forwarding state
Apr 5 09:54:48 Crimson rc.docker: mariadb: started succesfully!
Apr 5 09:54:48 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered blocking state
Apr 5 09:54:48 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered disabled state
Apr 5 09:54:48 Crimson kernel: device vethde7b5a6 entered promiscuous mode
Apr 5 09:54:49 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface vethda3a3eb.IPv6 with address fe80::80d9:41ff:fec1:6cd6.
Apr 5 09:54:49 Crimson avahi-daemon[7523]: New relevant interface vethda3a3eb.IPv6 for mDNS.
Apr 5 09:54:49 Crimson avahi-daemon[7523]: Registering new address record for fe80::80d9:41ff:fec1:6cd6 on vethda3a3eb.*.
Apr 5 09:54:50 Crimson kernel: eth0: renamed from vethb3f1586
Apr 5 09:54:50 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethde7b5a6: link becomes ready
Apr 5 09:54:50 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered blocking state
Apr 5 09:54:50 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered forwarding state
Apr 5 09:54:51 Crimson rc.docker: Collabora: started succesfully!
Apr 5 09:54:52 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface vethde7b5a6.IPv6 with address fe80::50af:fcff:fe86:dbc3.
Apr 5 09:54:52 Crimson avahi-daemon[7523]: New relevant interface vethde7b5a6.IPv6 for mDNS.
Apr 5 09:54:52 Crimson avahi-daemon[7523]: Registering new address record for fe80::50af:fcff:fe86:dbc3 on vethde7b5a6.*.
Apr 5 09:54:52 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered blocking state
Apr 5 09:54:52 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered disabled state
Apr 5 09:54:52 Crimson kernel: device veth9c6caec entered promiscuous mode
Apr 5 09:54:53 Crimson kernel: eth0: renamed from vethcb0c948
Apr 5 09:54:54 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth9c6caec: link becomes ready
Apr 5 09:54:54 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered blocking state
Apr 5 09:54:54 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered forwarding state
Apr 5 09:54:54 Crimson rc.docker: nextcloud: started succesfully!
Apr 5 09:54:55 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface veth9c6caec.IPv6 with address fe80::c0b7:eff:fe8a:7434.
Apr 5 09:54:55 Crimson avahi-daemon[7523]: New relevant interface veth9c6caec.IPv6 for mDNS.
Apr 5 09:54:55 Crimson avahi-daemon[7523]: Registering new address record for fe80::c0b7:eff:fe8a:7434 on veth9c6caec.*.
Apr 5 09:55:00 Crimson kernel: ffdetect[35846]: segfault at 38 ip 00000000004038da sp 00007ffe2d64a9c0 error 4 in ffdetect[400000+14000]
Apr 5 09:55:00 Crimson kernel: Code: cc 34 21 00 41 0f b6 6d 00 40 84 ed 75 b7 48 8b 34 24 48 8d 3d 3c a2 00 00 31 c0 ff 15 6f 33 21 00 48 89 df ff 15 8e 34 21 00 <41> 0f b6 2c 24 40 84 ed 0f 84 93 00 00 00 4c 8d 35 a1 a5 00 00 eb
Apr 5 09:55:00 Crimson kernel: ffdetect[35943]: segfault at 38 ip 00000000004038da sp 00007ffc7ba23c90 error 4 in ffdetect[400000+14000]
Apr 5 09:55:00 Crimson kernel: Code: cc 34 21 00 41 0f b6 6d 00 40 84 ed 75 b7 48 8b 34 24 48 8d 3d 3c a2 00 00 31 c0 ff 15 6f 33 21 00 48 89 df ff 15 8e 34 21 00 <41> 0f b6 2c 24 40 84 ed 0f 84 93 00 00 00 4c 8d 35 a1 a5 00 00 eb
Apr 5 09:55:00 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 5 09:55:00 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs
Apr 5 09:55:08 Crimson kernel: BTRFS info (device sdj1): found 8170 extents, stage: update data pointers
Apr 5 09:55:20 Crimson kernel: BTRFS info (device sdj1): relocating block group 12291401449472 flags data|raid1
Apr 5 09:55:37 Crimson kernel: BTRFS info (device sdj1): found 9604 extents, stage: move data extents
Apr 5 09:55:59 Crimson kernel: BTRFS info (device sdj1): found 9603 extents, stage: update data pointers
Apr 5 09:56:00 Crimson root: Fix Common Problems Version 2021.04.02
Apr 5 09:56:12 Crimson kernel: BTRFS info (device sdj1): relocating block group 12290327707648 flags data|raid1
Apr 5 09:56:31 Crimson kernel: BTRFS info (device sdj1): found 9393 extents, stage: move data extents
Apr 5 09:56:49 Crimson kernel: BTRFS info (device sdj1): found 9392 extents, stage: update data pointers
Apr 5 09:56:58 Crimson kernel: BTRFS info (device sdj1): relocating block group 12289253965824 flags data|raid1
Apr 5 09:57:14 Crimson kernel: BTRFS info (device sdj1): found 8649 extents, stage: move data extents
Apr 5 09:57:29 Crimson kernel: BTRFS info (device sdj1): found 8649 extents, stage: update data pointers
Apr 5 09:57:40 Crimson kernel: BTRFS info (device sdj1): relocating block group 12288180224000 flags data|raid1
Apr 5 09:57:56 Crimson kernel: BTRFS info (device sdj1): found 8615 extents, stage: move data extents

Crimson Unraider · April 5, 2021

I found the reason for the shutdowns. My cpu cooler has failed and my cpu os overheating. I brought the PC up in the bios, after about 10 min it shut down. I noticed the cpu temp was 93 c.

John_M · April 5, 2021

4 minutes ago, Crimson Unraider said:

My cpu cooler has failed and my cpu os overheating.

OK, that would explain it! Replacing the cooler is now the priority. Your diagnostics from 0713 look much cleaner.

Crimson Unraider · April 13, 2021

John_M, Thanks for the help, all my problems were linked to the failing CPU cooler. I had to wait on parts but I went all in on water cooling kit and now I'm averaging 37 C and all is running fine. 🤪

Windows VM excessively slow and random server crashes after upgrade from 6.8.3 to 6.9.1

Recommended Posts

Crimson Unraider

Link to comment

John_M

Link to comment

Crimson Unraider

Link to comment

John_M

Link to comment

Crimson Unraider

Link to comment

John_M

Link to comment

Crimson Unraider

Link to comment

Crimson Unraider

Link to comment

Crimson Unraider

Link to comment

Crimson Unraider

Link to comment

Crimson Unraider

Link to comment

Crimson Unraider

Link to comment

John_M

Link to comment

Crimson Unraider

Link to comment

Join the conversation