Server Crashing every 30 min-hour, Root Cause is a Mystery


Go to solution Solved by NeonMinne,

Recommended Posts

Hey folks! Got a server crashing every few minutes now (not sure why), and immediatly rebooting.

 

I upgraded the CPU/RAM/MOBO/PSU recently, and it was humming along fine until today. I turned on "Mirror Syslog to Flash", and checked the file after the most recent crash/reboot cycle and the last message is never an error, it just seems to reboot randomly and not give an error.

 

I think this is more PSU related than anything (seemed to have issues before but never this specific issue), but can anyone confirm/point me in another direction? Its really starting to worry me something is wrong.

fenrir-diagnostics-20240105-1239.zip

Link to comment
1 hour ago, snowboardjoe said:

When it crashes, is it just automatically rebooting or stuck waiting for user intervention? If stuck, anything on the console?

 

It automatically reboots, I'm not even sure if its fully crashing, as its seems to reboot super fast but it takes the whole system down and then resets the uptime timer. 

Link to comment

Little update:

 

  1. Replaced PSU to old, known working one
  2. Some settings (mostly around docker) are reverting/changing on reboots sometimes?
    1. It "forgot" that I set to IPVLAN
    2. It "forgot" i set it to use a folder instead of vdisk

Not sure what's going on still, Memcheck also returned good values and temps are normal/CPU cooler is seated properly

Link to comment
Posted (edited)

Sorta using this to document troubleshooting in case anyone else hits this issue.

 

After more reboots and no errors at the end of syslog, I double checked all the AMD cstate stuff was off as documented in this thread:

 

Unfortunately, that doesn't help. It seemed stable at first, but then crashes and reboots.

I started checking towards the beginning of the logs (in case anything relevant was given), and was given

 

Quote

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5:

as an error listed. Googling this basically returns "everything under the sun could be the issue but check RAM".

At this point I'm running an extended MemTest to see if I can't find the little bug. If not, I honestly don't know next steps aside from trying to RMA the board.

IMG_1462.PNG

Edited by NeonMinne
Added screenshot
Link to comment
  • 2 weeks later...
  • Solution

Oh my god I figured it out.

 

Long story short, I didn't need any new hardware (anyone want some lightly used RAM/Mobo/PSU 😅?).

 

The issue came down to an apparent bug between AMD Ryzen, and LSI HBA cards (Gen2 version). Apparently the CPU and HBA would lose contact with each other and cause a hard reset to happen. I had to turn off Autonegotiation and manually set the lanes in my mobo's PCI-E settings and voila, it worked. I've been hammering it with read/writes to test with no crashes so this seems to be the fix.

 

Thanks to TheArtofServer for tipping me off: https://www.youtube.com/watch?v=b0fAKG3qa6Q

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.