Windows VM excessively slow and random server crashes after upgrade from 6.8.3 to 6.9.1


Recommended Posts

I have 2 Windows VMs, one on cache drive and on on nvme (followed Space Invaders's guide), I normally only use the nvme but I left the cache one on for testing.  Both worked fine in 6.8.3 and both are extremely slow after upgrade.  I noticed in task manager the "System Interrupts" was using over 60% of my CPU randomly on both VMs.  It is setup as a gaming VM using Nvidia 1660 TI with 16G of ram, I thought that the GPU passthrough may be the problem so I tried VNC and it was still slow.

 

I noticed this error while running GPU statistics plugin repeating and filling my log file so I uninstalled the plugin and the error stopped.

 

Quote

Apr 1 11:17:10 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 1 11:17:10 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs

 

I wasn't able to get the diagnostics after the random crashes but I did get the attached after running a VM.

 

My system log shows these errors after I start the VM (the errors repeat until I shut down the VM).

 

Quote

Apr 3 15:29:10 Crimson smbd[21858]: [2021/04/03 15:29:10.766144, 0] ../../lib/param/loadparm.c:801(lpcfg_map_parameter)
Apr 3 15:29:10 Crimson smbd[21858]: Unknown parameter encountered: "hide file"
Apr 3 15:29:10 Crimson smbd[21858]: [2021/04/03 15:29:10.766416, 0] ../../lib/param/loadparm.c:1841(lpcfg_do_global_parameter)
Apr 3 15:29:10 Crimson smbd[21858]: Ignoring unknown parameter "hide file"


 

I checked my Bios and it is the newest, HVM and IOMMU are Enabled

 

M/B: Gigabyte Technology Co., Ltd. X399 AORUS PRO-CF Version Default string

BIOS: American Megatrends Inc. Version F2. Dated: 12/11/2019

CPU: AMD Ryzen Threadripper 2950X 16-Core @ 3500 MHz

Memory: 128 GiB DDR4 (max. installable capacity 512 GiB)

crimson-diagnostics-20210403-1538.zip

Link to comment
23 minutes ago, Crimson Unraider said:

I thought that the GPU passthrough may be the problem

 

You can't pass through a GPU that's bound to a driver. Uninstall the Nvidia driver.

 

23 minutes ago, Crimson Unraider said:

I noticed this error while running GPU statistics plugin repeating and filling my log file

 

See this post and subsequent followups:

 

 

 

Edited by John_M
Link to comment

So I stubbed the 1660 and the Sanity check errors stopped.

 

Quote

Apr 1 11:17:10 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 1 11:17:10 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs

 

But it is still really slow and locking up.  I think it might be network related, task manager keeps showing "System Interrupts" excessive cpu usage when the freezing happens.  I googled it and "System Interrupts" usually means a hardware issue and most say it is likely caused by a nic or external device.  The only things plugged in is usb keyboard/mouse and an Xbox controller wireless adapter.  When I disable the windows network adapter I see less system interrupts but now I can't play most of my games.  

 

I think I need to walk away for the night. 

Link to comment
1 hour ago, Crimson Unraider said:

I think I need to walk away for the night.

 

Taking a break from a problem often helps. Meanwhile the diagnostics you posted earlier reveal that SSD KINGSTON SHSS37A480G (part of your cache pool) has cable problems. That is obviously affecting cache operation. I missed it earlier in the general noise. Shut down and check/replace the SATA cable and also check the power cable while you're there. Then power up and start the array, then post new diagnostics, which should be a bit tidier and easier to read.

 

Link to comment

So, another update, the nvme I put in was reporting 57 degrees C.  When I clicked on it to see the info page the server turned off.  I removed the nvme and restarted, the system started a parity check due to unclean shutdown but I stopped it until after troubleshooting. I don't want it to crash in the middle of parity.  When I pulled the nvme it was warm to touch but not hot. When I removed the nvme I put the samsung ssd in the main cache pool.  I moved it earlier to try to separate libvrt from the other traffic when the slow down first started, that didn't help so I went back to one pool. I also noticed that it is taking more than 5 min to boot up now. 

crimson-diagnostics-20210405-0846.zip

Link to comment

I opened the log as soon as I could at boot and captured this before it shut off again.

 

Quote

Apr 5 09:54:47 Crimson kernel: eth0: renamed from vethc78c9d9
Apr 5 09:54:47 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethda3a3eb: link becomes ready
Apr 5 09:54:47 Crimson kernel: docker0: port 3(vethda3a3eb) entered blocking state
Apr 5 09:54:47 Crimson kernel: docker0: port 3(vethda3a3eb) entered forwarding state
Apr 5 09:54:48 Crimson rc.docker: mariadb: started succesfully!
Apr 5 09:54:48 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered blocking state
Apr 5 09:54:48 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered disabled state
Apr 5 09:54:48 Crimson kernel: device vethde7b5a6 entered promiscuous mode
Apr 5 09:54:49 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface vethda3a3eb.IPv6 with address fe80::80d9:41ff:fec1:6cd6.
Apr 5 09:54:49 Crimson avahi-daemon[7523]: New relevant interface vethda3a3eb.IPv6 for mDNS.
Apr 5 09:54:49 Crimson avahi-daemon[7523]: Registering new address record for fe80::80d9:41ff:fec1:6cd6 on vethda3a3eb.*.
Apr 5 09:54:50 Crimson kernel: eth0: renamed from vethb3f1586
Apr 5 09:54:50 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethde7b5a6: link becomes ready
Apr 5 09:54:50 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered blocking state
Apr 5 09:54:50 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered forwarding state
Apr 5 09:54:51 Crimson rc.docker: Collabora: started succesfully!
Apr 5 09:54:52 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface vethde7b5a6.IPv6 with address fe80::50af:fcff:fe86:dbc3.
Apr 5 09:54:52 Crimson avahi-daemon[7523]: New relevant interface vethde7b5a6.IPv6 for mDNS.
Apr 5 09:54:52 Crimson avahi-daemon[7523]: Registering new address record for fe80::50af:fcff:fe86:dbc3 on vethde7b5a6.*.
Apr 5 09:54:52 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered blocking state
Apr 5 09:54:52 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered disabled state
Apr 5 09:54:52 Crimson kernel: device veth9c6caec entered promiscuous mode
Apr 5 09:54:53 Crimson kernel: eth0: renamed from vethcb0c948
Apr 5 09:54:54 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth9c6caec: link becomes ready
Apr 5 09:54:54 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered blocking state
Apr 5 09:54:54 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered forwarding state
Apr 5 09:54:54 Crimson rc.docker: nextcloud: started succesfully!
Apr 5 09:54:55 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface veth9c6caec.IPv6 with address fe80::c0b7:eff:fe8a:7434.
Apr 5 09:54:55 Crimson avahi-daemon[7523]: New relevant interface veth9c6caec.IPv6 for mDNS.
Apr 5 09:54:55 Crimson avahi-daemon[7523]: Registering new address record for fe80::c0b7:eff:fe8a:7434 on veth9c6caec.*.
Apr 5 09:55:00 Crimson kernel: ffdetect[35846]: segfault at 38 ip 00000000004038da sp 00007ffe2d64a9c0 error 4 in ffdetect[400000+14000]
Apr 5 09:55:00 Crimson kernel: Code: cc 34 21 00 41 0f b6 6d 00 40 84 ed 75 b7 48 8b 34 24 48 8d 3d 3c a2 00 00 31 c0 ff 15 6f 33 21 00 48 89 df ff 15 8e 34 21 00 <41> 0f b6 2c 24 40 84 ed 0f 84 93 00 00 00 4c 8d 35 a1 a5 00 00 eb
Apr 5 09:55:00 Crimson kernel: ffdetect[35943]: segfault at 38 ip 00000000004038da sp 00007ffc7ba23c90 error 4 in ffdetect[400000+14000]
Apr 5 09:55:00 Crimson kernel: Code: cc 34 21 00 41 0f b6 6d 00 40 84 ed 75 b7 48 8b 34 24 48 8d 3d 3c a2 00 00 31 c0 ff 15 6f 33 21 00 48 89 df ff 15 8e 34 21 00 <41> 0f b6 2c 24 40 84 ed 0f 84 93 00 00 00 4c 8d 35 a1 a5 00 00 eb
Apr 5 09:55:00 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 5 09:55:00 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs
Apr 5 09:55:08 Crimson kernel: BTRFS info (device sdj1): found 8170 extents, stage: update data pointers
Apr 5 09:55:20 Crimson kernel: BTRFS info (device sdj1): relocating block group 12291401449472 flags data|raid1
Apr 5 09:55:37 Crimson kernel: BTRFS info (device sdj1): found 9604 extents, stage: move data extents
Apr 5 09:55:59 Crimson kernel: BTRFS info (device sdj1): found 9603 extents, stage: update data pointers
Apr 5 09:56:00 Crimson root: Fix Common Problems Version 2021.04.02
Apr 5 09:56:12 Crimson kernel: BTRFS info (device sdj1): relocating block group 12290327707648 flags data|raid1
Apr 5 09:56:31 Crimson kernel: BTRFS info (device sdj1): found 9393 extents, stage: move data extents
Apr 5 09:56:49 Crimson kernel: BTRFS info (device sdj1): found 9392 extents, stage: update data pointers
Apr 5 09:56:58 Crimson kernel: BTRFS info (device sdj1): relocating block group 12289253965824 flags data|raid1
Apr 5 09:57:14 Crimson kernel: BTRFS info (device sdj1): found 8649 extents, stage: move data extents
Apr 5 09:57:29 Crimson kernel: BTRFS info (device sdj1): found 8649 extents, stage: update data pointers
Apr 5 09:57:40 Crimson kernel: BTRFS info (device sdj1): relocating block group 12288180224000 flags data|raid1
Apr 5 09:57:56 Crimson kernel: BTRFS info (device sdj1): found 8615 extents, stage: move data extents

Link to comment
4 minutes ago, Crimson Unraider said:

My cpu cooler has failed and my cpu os overheating.

 

OK, that would explain it! Replacing the cooler is now the priority. Your diagnostics from 0713 look much cleaner.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.