Jump to content

Windows VM or Server issue Since Bios update [SOLVED]


Recommended Posts

I have attached the diagnostic log and looking for any possible help on why or how to identify what is causing the the server to crash.  I can see the crash identified in the log but I'm not able to figure out what it is.  Everything was working fine to the point of the bios update as best I can tell.  

 

Many Thanks,

 

rackserver-diagnostics-20230708-1922.zip

Edited by Big Wig
Link to comment

Sorry I had attached the wrong Diagnostic.  In the sys log about 1/2 way down. 

I will also add that I left the server running with the windows VM off  this past night and it did not crash.  With the Windows 11 VM on I usually get anywhere from 10 min to 2 hours before a crash takes place.  I just cranked up the VM this morning and its been up for about an hour now.

Quote

Jul  8 18:48:53 RackServer kernel: [Hardware Error]: event severity: fatal
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:  Error 0, type: fatal
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:  fru_text: PcieError
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   section_type: PCIe error
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   port_type: 4, root port
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   version: 0.2
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   command: 0x0003, status: 0x0010
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   device_id: 0000:20:03.1
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   slot: 0
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   secondary_bus: 0x2d
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   class_code: 060400
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0010
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   aer_uncor_status: 0x00004000, aer_uncor_mask: 0x04000000
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   aer_uncor_severity: 0x00476030
Jul  8 18:48:53 RackServer kernel: [Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000

Thanks,

 

rackserver-diagnostics-20230708-1922(1).zip

Edited by Big Wig
Add additionl comment
Link to comment

What ever is going on is causing the whole server to crash and reboot.  This morning it lasted about  1 1/2 hours with windows VM running and then boom it goes down.  I went into bios and changed the video card Pcie Slot from Auto to Gen 3 and then rebooted.  The Windows 11 VM lasted about 5 min and then the entire system crashed.  The log did not show anything related to that crash but I will attach it anyway. 

Went back into bios and changed PCIE setting to Gen4 and will see what happen.  Not even sure I'm on the right track but I believe its something to do with the GPU.

rackserver-diagnostics-20230709-0825.zip

Link to comment

Update:

After reboot I restarted the Windows11 VM up and decided to run AIDA64 stability test on both CPU and GPU.  Ran two 5 Min tests (see attached) and 1 min after that was done system crashed.  I also put the PCIE setting back to AUTO in the Bios.  After the new upcomming crash :) I will try installing a NVIDIA GPU and see what happens.

 

Thanks,

 

stabilitytest.png

Link to comment

Update:

After reboot I restarted the Windows11 VM up and decided to run AIDA64 stability test on both CPU and GPU.  Ran two 5 Min tests (see attached) and 1 min after that was done system crashed.  I also put the PCIE setting back to AUTO in the Bios.  After the new upcomming crash :) I will try installing a NVIDIA GPU and see what happens.

 

Thanks,

 

Link to comment

Well I did not change the GPU but did a fresh install of both Windows 11 and Unbuntu.  The system will crash on both VMs but with no regularity.  I have attached both logs from the crash today at approx 12:15.  I happen to walk in on it when it was rebooting.  I truly appreciate the help!!!!!  I'm at a loss as to what is going on. 

 

Many Thanks,

rackserver-diagnostics-20230710-1244.zip syslog-192.168.1.220.log

Link to comment
  • Big Wig changed the title to Windows VM or Server issue Since Bios update [SOLVED]

Well I rolled back the bios and it cut down on the crashes but did not stop them.  I then left the server running without any active VMs and got the same PCIE error but this time the system corrected for the problem which lead me down another road of questions and answers.

 

The only 2 changes that I made to the server after the rollback is as follows:

 

1) In the bios under Onboard Device Configuration I changed both the U.2_1 Mode & U.2_2 Modes from PCIE to SATA since I was not using them. 

 

2) My SAS card had had a 4 drive bay connected to it that had 2 SSds and 2 HDDs.  I ended up going ahead and removing the 2 SSDs.  I will add that the SSDs had been hooked up previously for the past year without any issues.

 

Everything is running great now.

Many Thanks,

 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...