Crimson Unraider Posted April 3, 2021 Share Posted April 3, 2021 I have 2 Windows VMs, one on cache drive and on on nvme (followed Space Invaders's guide), I normally only use the nvme but I left the cache one on for testing. Both worked fine in 6.8.3 and both are extremely slow after upgrade. I noticed in task manager the "System Interrupts" was using over 60% of my CPU randomly on both VMs. It is setup as a gaming VM using Nvidia 1660 TI with 16G of ram, I thought that the GPU passthrough may be the problem so I tried VNC and it was still slow. I noticed this error while running GPU statistics plugin repeating and filling my log file so I uninstalled the plugin and the error stopped. Quote Apr 1 11:17:10 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 1 11:17:10 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs I wasn't able to get the diagnostics after the random crashes but I did get the attached after running a VM. My system log shows these errors after I start the VM (the errors repeat until I shut down the VM). Quote Apr 3 15:29:10 Crimson smbd[21858]: [2021/04/03 15:29:10.766144, 0] ../../lib/param/loadparm.c:801(lpcfg_map_parameter) Apr 3 15:29:10 Crimson smbd[21858]: Unknown parameter encountered: "hide file" Apr 3 15:29:10 Crimson smbd[21858]: [2021/04/03 15:29:10.766416, 0] ../../lib/param/loadparm.c:1841(lpcfg_do_global_parameter) Apr 3 15:29:10 Crimson smbd[21858]: Ignoring unknown parameter "hide file" I checked my Bios and it is the newest, HVM and IOMMU are Enabled M/B: Gigabyte Technology Co., Ltd. X399 AORUS PRO-CF Version Default string BIOS: American Megatrends Inc. Version F2. Dated: 12/11/2019 CPU: AMD Ryzen Threadripper 2950X 16-Core @ 3500 MHz Memory: 128 GiB DDR4 (max. installable capacity 512 GiB) crimson-diagnostics-20210403-1538.zip Quote Link to comment
John_M Posted April 3, 2021 Share Posted April 3, 2021 (edited) 23 minutes ago, Crimson Unraider said: I thought that the GPU passthrough may be the problem You can't pass through a GPU that's bound to a driver. Uninstall the Nvidia driver. 23 minutes ago, Crimson Unraider said: I noticed this error while running GPU statistics plugin repeating and filling my log file See this post and subsequent followups: Edited April 3, 2021 by John_M Quote Link to comment
Crimson Unraider Posted April 3, 2021 Author Share Posted April 3, 2021 I have 2 GPU's, I use my GTX 1070 with Plex and Emby. Can I stub the 1660 in vfio for use with my VM and keep the Nvidia driver for the 1070? Quote Link to comment
Crimson Unraider Posted April 3, 2021 Author Share Posted April 3, 2021 So I stubbed the 1660 and the Sanity check errors stopped. Quote Apr 1 11:17:10 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 1 11:17:10 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs But it is still really slow and locking up. I think it might be network related, task manager keeps showing "System Interrupts" excessive cpu usage when the freezing happens. I googled it and "System Interrupts" usually means a hardware issue and most say it is likely caused by a nic or external device. The only things plugged in is usb keyboard/mouse and an Xbox controller wireless adapter. When I disable the windows network adapter I see less system interrupts but now I can't play most of my games. I think I need to walk away for the night. Quote Link to comment
John_M Posted April 4, 2021 Share Posted April 4, 2021 1 hour ago, Crimson Unraider said: I think I need to walk away for the night. Taking a break from a problem often helps. Meanwhile the diagnostics you posted earlier reveal that SSD KINGSTON SHSS37A480G (part of your cache pool) has cable problems. That is obviously affecting cache operation. I missed it earlier in the general noise. Shut down and check/replace the SATA cable and also check the power cable while you're there. Then power up and start the array, then post new diagnostics, which should be a bit tidier and easier to read. Quote Link to comment
Crimson Unraider Posted April 5, 2021 Author Share Posted April 5, 2021 On 4/3/2021 at 9:13 PM, John_M said: SHSS37A480G John, I'm ready to get started on this again, I'm trying to find which drive has the cable problem, I have 4 of those drives in the pool. I'm pretty new to Unraid and I'm just curious where you found that. Quote Link to comment
Crimson Unraider Posted April 5, 2021 Author Share Posted April 5, 2021 Ok I changed the cables on all 4 of the Kingston SSDs, see attached Diagnostics after boot. Also, my ssds are all in an icy dock, I checked the power connectors but there are only two feeding the 6 ssds. Also, while I had it open I added another nvme for the second cache. crimson-diagnostics-20210405-0713.zip Quote Link to comment
Crimson Unraider Posted April 5, 2021 Author Share Posted April 5, 2021 So, another update, the nvme I put in was reporting 57 degrees C. When I clicked on it to see the info page the server turned off. I removed the nvme and restarted, the system started a parity check due to unclean shutdown but I stopped it until after troubleshooting. I don't want it to crash in the middle of parity. When I pulled the nvme it was warm to touch but not hot. When I removed the nvme I put the samsung ssd in the main cache pool. I moved it earlier to try to separate libvrt from the other traffic when the slow down first started, that didn't help so I went back to one pool. I also noticed that it is taking more than 5 min to boot up now. crimson-diagnostics-20210405-0846.zip Quote Link to comment
Crimson Unraider Posted April 5, 2021 Author Share Posted April 5, 2021 It just shut down again Quote Link to comment
Crimson Unraider Posted April 5, 2021 Author Share Posted April 5, 2021 I opened the log as soon as I could at boot and captured this before it shut off again. Quote Apr 5 09:54:47 Crimson kernel: eth0: renamed from vethc78c9d9 Apr 5 09:54:47 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethda3a3eb: link becomes ready Apr 5 09:54:47 Crimson kernel: docker0: port 3(vethda3a3eb) entered blocking state Apr 5 09:54:47 Crimson kernel: docker0: port 3(vethda3a3eb) entered forwarding state Apr 5 09:54:48 Crimson rc.docker: mariadb: started succesfully! Apr 5 09:54:48 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered blocking state Apr 5 09:54:48 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered disabled state Apr 5 09:54:48 Crimson kernel: device vethde7b5a6 entered promiscuous mode Apr 5 09:54:49 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface vethda3a3eb.IPv6 with address fe80::80d9:41ff:fec1:6cd6. Apr 5 09:54:49 Crimson avahi-daemon[7523]: New relevant interface vethda3a3eb.IPv6 for mDNS. Apr 5 09:54:49 Crimson avahi-daemon[7523]: Registering new address record for fe80::80d9:41ff:fec1:6cd6 on vethda3a3eb.*. Apr 5 09:54:50 Crimson kernel: eth0: renamed from vethb3f1586 Apr 5 09:54:50 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethde7b5a6: link becomes ready Apr 5 09:54:50 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered blocking state Apr 5 09:54:50 Crimson kernel: br-ee7cefde1519: port 7(vethde7b5a6) entered forwarding state Apr 5 09:54:51 Crimson rc.docker: Collabora: started succesfully! Apr 5 09:54:52 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface vethde7b5a6.IPv6 with address fe80::50af:fcff:fe86:dbc3. Apr 5 09:54:52 Crimson avahi-daemon[7523]: New relevant interface vethde7b5a6.IPv6 for mDNS. Apr 5 09:54:52 Crimson avahi-daemon[7523]: Registering new address record for fe80::50af:fcff:fe86:dbc3 on vethde7b5a6.*. Apr 5 09:54:52 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered blocking state Apr 5 09:54:52 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered disabled state Apr 5 09:54:52 Crimson kernel: device veth9c6caec entered promiscuous mode Apr 5 09:54:53 Crimson kernel: eth0: renamed from vethcb0c948 Apr 5 09:54:54 Crimson kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth9c6caec: link becomes ready Apr 5 09:54:54 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered blocking state Apr 5 09:54:54 Crimson kernel: br-ee7cefde1519: port 8(veth9c6caec) entered forwarding state Apr 5 09:54:54 Crimson rc.docker: nextcloud: started succesfully! Apr 5 09:54:55 Crimson avahi-daemon[7523]: Joining mDNS multicast group on interface veth9c6caec.IPv6 with address fe80::c0b7:eff:fe8a:7434. Apr 5 09:54:55 Crimson avahi-daemon[7523]: New relevant interface veth9c6caec.IPv6 for mDNS. Apr 5 09:54:55 Crimson avahi-daemon[7523]: Registering new address record for fe80::c0b7:eff:fe8a:7434 on veth9c6caec.*. Apr 5 09:55:00 Crimson kernel: ffdetect[35846]: segfault at 38 ip 00000000004038da sp 00007ffe2d64a9c0 error 4 in ffdetect[400000+14000] Apr 5 09:55:00 Crimson kernel: Code: cc 34 21 00 41 0f b6 6d 00 40 84 ed 75 b7 48 8b 34 24 48 8d 3d 3c a2 00 00 31 c0 ff 15 6f 33 21 00 48 89 df ff 15 8e 34 21 00 <41> 0f b6 2c 24 40 84 ed 0f 84 93 00 00 00 4c 8d 35 a1 a5 00 00 eb Apr 5 09:55:00 Crimson kernel: ffdetect[35943]: segfault at 38 ip 00000000004038da sp 00007ffc7ba23c90 error 4 in ffdetect[400000+14000] Apr 5 09:55:00 Crimson kernel: Code: cc 34 21 00 41 0f b6 6d 00 40 84 ed 75 b7 48 8b 34 24 48 8d 3d 3c a2 00 00 31 c0 ff 15 6f 33 21 00 48 89 df ff 15 8e 34 21 00 <41> 0f b6 2c 24 40 84 ed 0f 84 93 00 00 00 4c 8d 35 a1 a5 00 00 eb Apr 5 09:55:00 Crimson kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 5 09:55:00 Crimson kernel: caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs Apr 5 09:55:08 Crimson kernel: BTRFS info (device sdj1): found 8170 extents, stage: update data pointers Apr 5 09:55:20 Crimson kernel: BTRFS info (device sdj1): relocating block group 12291401449472 flags data|raid1 Apr 5 09:55:37 Crimson kernel: BTRFS info (device sdj1): found 9604 extents, stage: move data extents Apr 5 09:55:59 Crimson kernel: BTRFS info (device sdj1): found 9603 extents, stage: update data pointers Apr 5 09:56:00 Crimson root: Fix Common Problems Version 2021.04.02 Apr 5 09:56:12 Crimson kernel: BTRFS info (device sdj1): relocating block group 12290327707648 flags data|raid1 Apr 5 09:56:31 Crimson kernel: BTRFS info (device sdj1): found 9393 extents, stage: move data extents Apr 5 09:56:49 Crimson kernel: BTRFS info (device sdj1): found 9392 extents, stage: update data pointers Apr 5 09:56:58 Crimson kernel: BTRFS info (device sdj1): relocating block group 12289253965824 flags data|raid1 Apr 5 09:57:14 Crimson kernel: BTRFS info (device sdj1): found 8649 extents, stage: move data extents Apr 5 09:57:29 Crimson kernel: BTRFS info (device sdj1): found 8649 extents, stage: update data pointers Apr 5 09:57:40 Crimson kernel: BTRFS info (device sdj1): relocating block group 12288180224000 flags data|raid1 Apr 5 09:57:56 Crimson kernel: BTRFS info (device sdj1): found 8615 extents, stage: move data extents Quote Link to comment
Crimson Unraider Posted April 5, 2021 Author Share Posted April 5, 2021 I found the reason for the shutdowns. My cpu cooler has failed and my cpu os overheating. I brought the PC up in the bios, after about 10 min it shut down. I noticed the cpu temp was 93 c. Quote Link to comment
John_M Posted April 5, 2021 Share Posted April 5, 2021 4 minutes ago, Crimson Unraider said: My cpu cooler has failed and my cpu os overheating. OK, that would explain it! Replacing the cooler is now the priority. Your diagnostics from 0713 look much cleaner. 1 Quote Link to comment
Crimson Unraider Posted April 13, 2021 Author Share Posted April 13, 2021 John_M, Thanks for the help, all my problems were linked to the failing CPU cooler. I had to wait on parts but I went all in on water cooling kit and now I'm averaging 37 C and all is running fine. 🤪 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.