jordanmw Posted May 22, 2019 Share Posted May 22, 2019 (edited) Ok, this is a weird one. I have 4 gaming VMs setup and worked great with 4x 960s- no issues in any games no matter how long we play or what is thrown at it. Upgraded 2 of the GPUs to 2070s and everything appeared to be great- passed through all devices from those cards to my machines and gaming was great- but only for so long. After gaming for a couple of hours, those 2 machines will go black screen, flipping on and off the monitor. If I unplug the hdmi from one card at that point- the other VM comes back and has no issues- can play for hours more. The other machine has to be rebooted to come back up- and usually will require a couple of resets to get the GPU back- but it eventually works and can play for several more hours without issue. I can login remotely to the machine that needs the reboot before rebooting, and can see that the game is still playing and functional. It just won't re-enable the monitor output, and every time I plug it back in (before reboot) it takes out the screen for VM #2. Once it reboots, I can plug both monitors back in and continue as normal. Looking at the logs, here are the errors it shows: (Receiver ID) May 20 20:43:02 Tower kernel: pcieport 0000:40:01.3: device [1022:1453] error status/mask=00000040/00006000 May 20 20:43:02 Tower kernel: pcieport 0000:40:01.3: [ 6] Bad TLP May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:00:00.0 May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: device [1022:1453] error status/mask=00000040/00006000 May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: [ 6] Bad TLP May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:00:00.0 May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, It's complaining about this device: [1022:1453] 40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge Not sure where to go from here- looks like everything is passing through correctly, Diag attached. tower-diagnostics-20190522-0844.zip Edited May 22, 2019 by jordanmw Quote Link to comment
jordanmw Posted May 22, 2019 Author Share Posted May 22, 2019 Saw another thread saying this is the fix: pcie_aspm=off Anyone else have to do this to get things going? Why would there be a change needed after upgrading from 960s to 2070s? Quote Link to comment
jordanmw Posted May 23, 2019 Author Share Posted May 23, 2019 Well that didn't help. Added the pcie_aspm=off with no result. Anyone have any ideas? Quote Link to comment
JorgeB Posted May 23, 2019 Share Posted May 23, 2019 Bios update might help if available, you can also try latest Unraid since it uses a newer kernel. 1 Quote Link to comment
jordanmw Posted May 23, 2019 Author Share Posted May 23, 2019 21 minutes ago, johnnie.black said: Bios update might help if available, you can also try latest Unraid since it uses a newer kernel. But I'm scared.... 😱 Everything was so dialed in before I swapped GPUs.... guess I'll give it a shot, wish me luck.... Quote Link to comment
jordanmw Posted May 23, 2019 Author Share Posted May 23, 2019 Well I updated to the newest unraid- nothing blew up yet, but won't know if it helped until we can beat it up for a few hours. I'll report back with results. Quote Link to comment
jordanmw Posted May 24, 2019 Author Share Posted May 24, 2019 (edited) Latest unraid makes no difference, will attach new diag in a few. tower-diagnostics-20190524-1330.zip Edited May 24, 2019 by jordanmw Quote Link to comment
jordanmw Posted May 24, 2019 Author Share Posted May 24, 2019 It is very strange- as soon as VM#4 starts having issues- blacking the screen with a few lines of pixels, VM#3 starts having issues- with the same black screen but flashes a few times before it stops displaying. Then if I unplug the monitor for VM#4- VM#3 is completely fine and continues on with no issue. If a wait a while and plug the monitor back in- VM#3 goes back to black screen. I can then force stop VM#4 and still if I plug in a monitor to VM#4- VM#3 goes back to black. After I reset that machine again, and login remotely- it shows VM#4 with 800x600 res, and unresponsive. I can reset it again, then plug the monitor back in and everything on VM#3 has no issues and screen doesn't go black. VM#4 then boots just fine and has no issues from then on- and the issue never occurs again. I am thinking this has to be an interrupt issue since it hits 2 machines unless one is reset multiple times. Anyone make sense of this craziness Quote Link to comment
jordanmw Posted May 28, 2019 Author Share Posted May 28, 2019 After some further testing- it appears that once the issue occurs, if you even touch the hdmi cable to the port on VM#4- VM#3 blanks the screen and resumes immediately after it is removed. It's like a grounding issue. It goes away after a few resets of VM#4 and continues to work fine for another 90 or so minutes before having the issue again. Quote Link to comment
jordanmw Posted May 30, 2019 Author Share Posted May 30, 2019 updated literally everything at this point- really strange that it only effects the VMs that I upgraded the GPU. Going to grab a display port cable and see if that is any different. I can literally login to that machine after the screen blanks and see that the game is still playing with no issue. Quote Link to comment
jordanmw Posted June 12, 2019 Author Share Posted June 12, 2019 (edited) Still seeing this issue, can anyone give feedback or have any experience with ANYTHING like this?!? This is literally the ONLY issue I am having with my 4 headed gaming rig- if I can get it sorted out- it will be PERFECT! Edited June 12, 2019 by jordanmw Quote Link to comment
jbartlett Posted June 13, 2019 Share Posted June 13, 2019 (edited) Do you have an auxiliary molex power port for the PCIe bus and if so, using it? If not, a riser extender that takes power may help. Edited June 13, 2019 by jbartlett Quote Link to comment
jordanmw Posted June 13, 2019 Author Share Posted June 13, 2019 24 minutes ago, jbartlett said: Do you have an auxiliary molex power port for the PCIe bus and if so, using it? If not, a riser extender that takes power may help. Yep. Plus- the 2070 that blacks out first has an 8+6pin so it is getting even more reliable power. Overclocking the cards also doesn't change anything, it still goes for the same 90 minutes before having the issue, then after a couple resets of that VM- works again for another 90 min. Quote Link to comment
jbartlett Posted June 13, 2019 Share Posted June 13, 2019 Couple of things I'd try at this point: 1. Update the nvidia drivers with the "clean install" option 2. Build a new VM with just the software you need to duplicate the blackout to eliminate if it's a OS issue or software 3. It's weird how the screens black out and consistently after a given period of time. A second power supply powering the 2070's can eliminate if it's a power supply issue. This reminds me of a 386 computer I was trying to diagnose back in the 90's. It had suffered a lightning related surge and it would only boot to DOS if the machine was powered up for no less than ten minutes and then warm rebooted. That bids the question - any surges that you are aware of? Shit gets real weird after one. We're talking gateways to alternate realities opening up in your PCIe lanes kinda weird. 1 Quote Link to comment
jordanmw Posted June 13, 2019 Author Share Posted June 13, 2019 (edited) Thanks for the suggestions John, I'll give some of that a shot- better than the crickets I was getting with this issue from everyone else. I have done #1 but haven't tried #2 or #3 since I rarely get much chance to troubleshoot because someone is always using it when I am home. No surges that I am aware of, and protection at every outlet- no other weirdness. Edited June 13, 2019 by jordanmw Quote Link to comment
jbartlett Posted June 13, 2019 Share Posted June 13, 2019 I'd try the power supply route first because I think you are dealing with a power issue of some kind. Motherboard or video cards OC'ed? If so, go back to stock OC settings. Make sure Afterburner or the like isn't running or set to auto-load. Another thing to try is undervolting. Quote Link to comment
jordanmw Posted June 13, 2019 Author Share Posted June 13, 2019 I don't have anything overclocked- just did that to see if it would change the amount of time that it could perform without issue- but no change. I thought maybe it would lead to less time, but it still goes for the same time before having that happen again. Haven't tried undervolting yet- that might be something I can test also. Don't you need to tweak GPU bios to get an undervolt? Quote Link to comment
jbartlett Posted June 14, 2019 Share Posted June 14, 2019 9 hours ago, jordanmw said: Don't you need to tweak GPU bios to get an undervolt? Software such as Afterburner has sliders that let you give the GPU less power but it's been over a year since I mucked with OC. Quote Link to comment
bastl Posted June 14, 2019 Share Posted June 14, 2019 Is your Windows activated correctly? Black screen after 90 minutes, just sayin 😂 Quote Link to comment
jordanmw Posted June 14, 2019 Author Share Posted June 14, 2019 7 hours ago, bastl said: Is your Windows activated correctly? Black screen after 90 minutes, just sayin 😂 Hell no! I don't activate windows on any of my VMs and none of the others have issues- just sayin. Quote Link to comment
jordanmw Posted June 28, 2019 Author Share Posted June 28, 2019 Well I did try a new PSU- with the same results. Here is the crazy part- at this point I think it's hdmi cables that are causing the issues. I ran a few experiments and if I use a shorter hdmi for VM#1 it never goes black, but VM#2 does if it is still using the longer cables. VM#3 even does it for a few minutes. Screen goes black, flashes on and off for maybe 5-10 minutes, then everything works properly again on all VMs and we continue playing for another 90 min or so before it happens the exact same way again?!? My only guess at this point is that the longer cables are causing errors that stack up for 90 minutes before crashing some kind of buffer and restarting that buffer. Anyone know anything about hdmi length causing these kinds of issues? What kind of error in hdmi signals can cause this kind of behavior? Quote Link to comment
Squid Posted June 28, 2019 Share Posted June 28, 2019 Max length on an HDMI cable without being an active cable is 50' But, you can also try within Windows setting different refresh rates in case this is a case where the video card is on one side of the rate, and the monitor is on the other side of the rate Quote Link to comment
jordanmw Posted June 28, 2019 Author Share Posted June 28, 2019 2 minutes ago, Squid said: Max length on an HDMI cable without being an active cable is 50' But, you can also try within Windows setting different refresh rates in case this is a case where the video card is on one side of the rate, and the monitor is on the other side of the rate Thanks Squid- yeah, I'm at max length. Maybe you are right and I just need to tweak refresh settings. I'll have some time to test things more thoroughly this weekend. Don't know why I haven't tried just swapping cables.... duh! Quote Link to comment
GHunter Posted June 28, 2019 Share Posted June 28, 2019 Glad you've made progress on this. I have 4 of the 50ft cables myself. 1 of them went bad and monoprice replaced it for me under warranty. Quote Link to comment
jbartlett Posted June 28, 2019 Share Posted June 28, 2019 If you're at range, HDMI over Cat-5e+ might be a viable option. Amazon has options under $50 that says you can go up to 50 meters, more bucks for longer. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.