Random reboots and MCE Hardware error


Recommended Posts

Hello,

In the past month I have started getting random reboots and MCE Hardware errors in my log. Running a intel i9-10850k on an Asrock Z590 Pro 4 with 64gb of RAM. I have an HBA card and a Nvidia P2000 installed and nothing else. I changed out my HBA card the last time I posted about this thinking that was the issue since I was having drives disappearing. At the time I had also done a memtest at that time and it passed. Since then the errors have continued. 

 

I have attached my diagnostics after the latest reboot that happened last night (Apr 8th around 1 AM it seems). The server was nearing completion of a parity sync and the reboot occurred. I have seen this behavior at least 2 times before. Also have the syslog server running for the past couple weeks to see if I can catch anything.

 

I am starting to wonder if the motherboard has gone bad somehow. The HBA card issue happened on PCIEx16 slot 1 and after replacing the HBA card with a brand new one it has been ok with no drive issues. However since then, one of the hardware errors/reboots was preceded by a transcoder error with my nvidia gpu on PCIEx16 slot 2. The latest reboot does not show either of the same errors I have seen before. 

 

Some help would be greatly appreciated.

unraid-diagnostics-20230408-0907.zip syslog-10.10.20.11.log

Edited by flallnatural
Link to comment
5 minutes ago, flallnatural said:

i9-10850k

First thing to do is to ensure that your TDW is set correctly within the BIOS.  And NOT Auto.  In your case it should be set to 125W. "Auto" will usually run your CPU in an inherent overclock situation where it will completely ignore the TDP limits of your processor.  

Link to comment
4 minutes ago, Squid said:

First thing to do is to ensure that your TDW is set correctly within the BIOS.  And NOT Auto.  In your case it should be set to 125W. "Auto" will usually run your CPU in an inherent overclock situation where it will completely ignore the TDP limits of your processor.  

 

Hm ok I've had it on Auto for well over a year now with no issues with Auto on the CPU ratio limit too. I will change that first thing. Any other recommendations?

 

I'm not sure how else to test for these hardware errors after changes other than letting it run for a while and waiting for a reboot to happen because it always seems stable for a while everything working great and then boom reboot.

Link to comment
38 minutes ago, Squid said:

Stop overclocking your memory via the XMP profile.  Run it instead at the speed you actually bought.  2133 not 3600   (ie: run at SPD speed not XMP), and also run memtest for minimum of a pass or 2

 

 

Alright I updated BIOS to latest version, and from stock settings I set the power limit and no XMP. No other changes in BIOS. Running the memtest now so lets see what happens.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.