Big Wig Posted July 9, 2023 Share Posted July 9, 2023 (edited) I have attached the diagnostic log and looking for any possible help on why or how to identify what is causing the the server to crash. I can see the crash identified in the log but I'm not able to figure out what it is. Everything was working fine to the point of the bios update as best I can tell. Many Thanks, rackserver-diagnostics-20230708-1922.zip Edited July 17, 2023 by Big Wig Quote Link to comment
JorgeB Posted July 9, 2023 Share Posted July 9, 2023 9 hours ago, Big Wig said: I can see the crash identified in the log Where? Quote Link to comment
Big Wig Posted July 9, 2023 Author Share Posted July 9, 2023 (edited) Sorry I had attached the wrong Diagnostic. In the sys log about 1/2 way down. I will also add that I left the server running with the windows VM off this past night and it did not crash. With the Windows 11 VM on I usually get anywhere from 10 min to 2 hours before a crash takes place. I just cranked up the VM this morning and its been up for about an hour now. Quote Jul 8 18:48:53 RackServer kernel: [Hardware Error]: event severity: fatal Jul 8 18:48:53 RackServer kernel: [Hardware Error]: Error 0, type: fatal Jul 8 18:48:53 RackServer kernel: [Hardware Error]: fru_text: PcieError Jul 8 18:48:53 RackServer kernel: [Hardware Error]: section_type: PCIe error Jul 8 18:48:53 RackServer kernel: [Hardware Error]: port_type: 4, root port Jul 8 18:48:53 RackServer kernel: [Hardware Error]: version: 0.2 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: command: 0x0003, status: 0x0010 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: device_id: 0000:20:03.1 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: slot: 0 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: secondary_bus: 0x2d Jul 8 18:48:53 RackServer kernel: [Hardware Error]: vendor_id: 0x1022, device_id: 0x1483 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: class_code: 060400 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0010 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: aer_uncor_status: 0x00004000, aer_uncor_mask: 0x04000000 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: aer_uncor_severity: 0x00476030 Jul 8 18:48:53 RackServer kernel: [Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000 Thanks, rackserver-diagnostics-20230708-1922(1).zip Edited July 9, 2023 by Big Wig Add additionl comment Quote Link to comment
JorgeB Posted July 9, 2023 Share Posted July 9, 2023 Do you know the time it crashed? Only the VM crashed or did the server crash? Quote Link to comment
Big Wig Posted July 9, 2023 Author Share Posted July 9, 2023 What ever is going on is causing the whole server to crash and reboot. This morning it lasted about 1 1/2 hours with windows VM running and then boom it goes down. I went into bios and changed the video card Pcie Slot from Auto to Gen 3 and then rebooted. The Windows 11 VM lasted about 5 min and then the entire system crashed. The log did not show anything related to that crash but I will attach it anyway. Went back into bios and changed PCIE setting to Gen4 and will see what happen. Not even sure I'm on the right track but I believe its something to do with the GPU. rackserver-diagnostics-20230709-0825.zip Quote Link to comment
Big Wig Posted July 9, 2023 Author Share Posted July 9, 2023 Update: After reboot I restarted the Windows11 VM up and decided to run AIDA64 stability test on both CPU and GPU. Ran two 5 Min tests (see attached) and 1 min after that was done system crashed. I also put the PCIE setting back to AUTO in the Bios. After the new upcomming crash I will try installing a NVIDIA GPU and see what happens. Thanks, Quote Link to comment
Big Wig Posted July 9, 2023 Author Share Posted July 9, 2023 Update: After reboot I restarted the Windows11 VM up and decided to run AIDA64 stability test on both CPU and GPU. Ran two 5 Min tests (see attached) and 1 min after that was done system crashed. I also put the PCIE setting back to AUTO in the Bios. After the new upcomming crash I will try installing a NVIDIA GPU and see what happens. Thanks, Quote Link to comment
Big Wig Posted July 9, 2023 Author Share Posted July 9, 2023 Well that didn't take long to crash. See attached for the latest server crash. I'm going to install another GPU and let ya know. rackserver-diagnostics-20230709-0928.zip Quote Link to comment
JorgeB Posted July 10, 2023 Share Posted July 10, 2023 The syslog starts over after every boot, so if the server is crashing there won't be much to see, you can enable the syslog server and post that to see if that catches something. Quote Link to comment
Big Wig Posted July 10, 2023 Author Share Posted July 10, 2023 Well I did not change the GPU but did a fresh install of both Windows 11 and Unbuntu. The system will crash on both VMs but with no regularity. I have attached both logs from the crash today at approx 12:15. I happen to walk in on it when it was rebooting. I truly appreciate the help!!!!! I'm at a loss as to what is going on. Many Thanks, rackserver-diagnostics-20230710-1244.zip syslog-192.168.1.220.log Quote Link to comment
JorgeB Posted July 10, 2023 Share Posted July 10, 2023 16 minutes ago, Big Wig said: at approx 12:15 Are these after the crash? Jul 10 12:36:53 RackServer kernel: WARNING: CPU: 31 PID: 15638 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:656 amdgpu_irq_put+0x4e/0x90 [amdgpu] amdgpu issue are the only ones I see logged. Quote Link to comment
Big Wig Posted July 10, 2023 Author Share Posted July 10, 2023 Yes the crash would have been before the 12:36 timeframe. It was very close to the 12:15 as I looked at my watch. Since you mentioned the time I also looked at the time setting on the server to make sure it was correct and it is. Thanks, Quote Link to comment
JorgeB Posted July 11, 2023 Share Posted July 11, 2023 In that case and since there's nothing relevant logged it looks more like a hardware issue, can the BIOS be downgraded to the previous version? Quote Link to comment
Big Wig Posted July 11, 2023 Author Share Posted July 11, 2023 Just rolled it back. I will see what happens today and report back.. Thanks, Quote Link to comment
Big Wig Posted July 17, 2023 Author Share Posted July 17, 2023 Well I rolled back the bios and it cut down on the crashes but did not stop them. I then left the server running without any active VMs and got the same PCIE error but this time the system corrected for the problem which lead me down another road of questions and answers. The only 2 changes that I made to the server after the rollback is as follows: 1) In the bios under Onboard Device Configuration I changed both the U.2_1 Mode & U.2_2 Modes from PCIE to SATA since I was not using them. 2) My SAS card had had a 4 drive bay connected to it that had 2 SSds and 2 HDDs. I ended up going ahead and removing the 2 SSDs. I will add that the SSDs had been hooked up previously for the past year without any issues. Everything is running great now. Many Thanks, 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.