Strange crashes after upgrading GPUs


Recommended Posts

Ok, this is a weird one.  I have 4 gaming VMs setup and worked great with 4x 960s- no issues in any games no matter how long we play or what is thrown at it.  Upgraded 2 of the GPUs to 2070s and everything appeared to be great- passed through all devices from those cards to my machines and gaming was great- but only for so long.  After gaming for a couple of hours, those 2 machines will go black screen, flipping on and off the monitor.  If I unplug the hdmi from one card at that point- the other VM comes back and has no issues- can play for hours more.  The other machine has to be rebooted to come back up- and usually will require a couple of resets to get the GPU back- but it eventually works and can play for several more hours without issue.  I can login remotely to the machine that needs the reboot before rebooting, and can see that the game is still playing and functional.  It just won't re-enable the monitor output, and every time I plug it back in (before reboot) it takes out the screen for VM #2.  Once it reboots, I can plug both monitors back in and continue as normal.  

 

Looking at the logs, here are the errors it shows:

(Receiver ID)
May 20 20:43:02 Tower kernel: pcieport 0000:40:01.3: device [1022:1453] error status/mask=00000040/00006000
May 20 20:43:02 Tower kernel: pcieport 0000:40:01.3: [ 6] Bad TLP 
May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:00:00.0
May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: device [1022:1453] error status/mask=00000040/00006000
May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: [ 6] Bad TLP 
May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:00:00.0
May 20 20:43:03 Tower kernel: pcieport 0000:40:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, 

 

It's complaining about this device:

[1022:1453] 40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

 

Not sure where to go from here- looks like everything is passing through correctly, Diag attached.

 

tower-diagnostics-20190522-0844.zip

Edited by jordanmw
Link to comment
21 minutes ago, johnnie.black said:

Bios update might help if available, you can also try latest Unraid since it uses a newer kernel.

But I'm scared.... 😱

 

Everything was so dialed in before I swapped GPUs.... guess I'll give it a shot, wish me luck....

Link to comment

It is very strange- as soon as VM#4 starts having issues- blacking the screen with a few lines of pixels, VM#3 starts having issues- with the same black screen but flashes a few times before it stops displaying.  Then if I unplug the monitor for VM#4- VM#3 is completely fine and continues on with no issue.  If a wait a while and plug the monitor back in- VM#3 goes back to black screen.  I can then force stop VM#4 and still if I plug in a monitor to VM#4- VM#3 goes back to black.  After I reset that machine again, and login remotely- it shows VM#4 with 800x600 res, and unresponsive.  I can reset it again, then plug the monitor back in and everything on VM#3 has no issues and screen doesn't go black.  VM#4 then boots just fine and has no issues from then on- and the issue never occurs again. 

 

I am thinking this has to be an interrupt issue since it hits 2 machines unless one is reset multiple times.  Anyone make sense of this craziness  

Link to comment

After some further testing- it appears that once the issue occurs, if you even touch the hdmi cable to the port on VM#4- VM#3 blanks the screen and resumes immediately after it is removed.  It's like a grounding issue.  It goes away after a few resets of VM#4 and continues to work fine for another 90 or so minutes before having the issue again.

Link to comment

updated literally everything at this point- really strange that it only effects the VMs that I upgraded the GPU.  Going to grab a display port cable and see if that is any different.  I can literally login to that machine after the screen blanks and see that the game is still playing with no issue.

Link to comment
  • 2 weeks later...

Still seeing this issue, can anyone give feedback or have any experience with ANYTHING like this?!?

 

This is literally the ONLY issue I am having with my 4 headed gaming rig- if I can get it sorted out- it will be PERFECT!

Edited by jordanmw
Link to comment
24 minutes ago, jbartlett said:

Do you have an auxiliary molex power port for the PCIe bus and if so, using it?

 

If not, a riser extender that takes power may help.

Yep. Plus- the 2070 that blacks out first has an 8+6pin so it is getting even more reliable power.  Overclocking the cards also doesn't change anything, it still goes for the same 90 minutes before having the issue, then after a couple resets of that VM- works again for another 90 min. 

Link to comment

Couple of things I'd try at this point:

1. Update the nvidia drivers with the "clean install" option

2. Build a new VM with just the software you need to duplicate the blackout to eliminate if it's a OS issue or software

3. It's weird how the screens black out and consistently after a given period of time. A second power supply powering the 2070's can eliminate if it's a power supply issue.

 

This reminds me of a 386 computer I was trying to diagnose back in the 90's. It had suffered a lightning related surge and it would only boot to DOS if the machine was powered up for no less than ten minutes and then warm rebooted. That bids the question - any surges that you are aware of? Shit gets real weird after one. We're talking gateways to alternate realities opening up in your PCIe lanes kinda weird.

  • Like 1
Link to comment

Thanks for the suggestions John, I'll give some of that a shot- better than the crickets I was getting with this issue from everyone else.  I have done #1 but haven't tried #2 or #3 since I rarely get much chance to troubleshoot because someone is always using it when I am home.

 

No surges that I am aware of, and protection at every outlet- no other weirdness.

Edited by jordanmw
Link to comment

I'd try the power supply route first because I think you are dealing with a power issue of some kind.

 

Motherboard or video cards OC'ed? If so, go back to stock OC settings. Make sure Afterburner or the like isn't running or set to auto-load. Another thing to try is undervolting.

Link to comment

I don't have anything overclocked- just did that to see if it would change the amount of time that it could perform without issue- but no change.  I thought maybe it would lead to less time, but it still goes for the same time before having that happen again.  

 

Haven't tried undervolting yet- that might be something I can test also.  Don't you need to tweak GPU bios to get an undervolt?

Link to comment
  • 2 weeks later...

Well I did try a new PSU- with the same results.  Here is the crazy part- at this point I think it's hdmi cables that are causing the issues.  I ran a few experiments and if I use a shorter hdmi for VM#1 it never goes black, but VM#2 does if it is still using the longer cables.  VM#3 even does it for a few minutes.  Screen goes black, flashes on and off for maybe 5-10 minutes, then everything works properly again on all VMs and we continue playing for another 90 min or so before it happens the exact same way again?!?  My only guess at this point is that the longer cables are causing errors that stack up for 90 minutes before crashing some kind of buffer and restarting that buffer.  Anyone know anything about hdmi length causing these kinds of issues?  What kind of error in hdmi signals can cause this kind of behavior?

Link to comment
2 minutes ago, Squid said:

Max length on an HDMI cable without being an active cable is 50'  But, you can also try within Windows setting different refresh rates in case this is a case where the video card is on one side of the rate, and the monitor is on the other side of the rate 

Thanks Squid- yeah, I'm at max length.  Maybe you are right and I just need to tweak refresh settings.  I'll have some time to test things more thoroughly this weekend.  Don't know why I haven't tried just swapping cables.... duh! 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.