NeonMinne Posted January 5 Share Posted January 5 Hey folks! Got a server crashing every few minutes now (not sure why), and immediatly rebooting. I upgraded the CPU/RAM/MOBO/PSU recently, and it was humming along fine until today. I turned on "Mirror Syslog to Flash", and checked the file after the most recent crash/reboot cycle and the last message is never an error, it just seems to reboot randomly and not give an error. I think this is more PSU related than anything (seemed to have issues before but never this specific issue), but can anyone confirm/point me in another direction? Its really starting to worry me something is wrong. fenrir-diagnostics-20240105-1239.zip Quote Link to comment
itimpi Posted January 5 Share Posted January 5 An unexpected reboot is nearly always hardware. The commonest cause would be inadequate power, with VPU overheating probably being next on the list. Quote Link to comment
snowboardjoe Posted January 5 Share Posted January 5 When it crashes, is it just automatically rebooting or stuck waiting for user intervention? If stuck, anything on the console? Quote Link to comment
NeonMinne Posted January 5 Author Share Posted January 5 1 hour ago, snowboardjoe said: When it crashes, is it just automatically rebooting or stuck waiting for user intervention? If stuck, anything on the console? It automatically reboots, I'm not even sure if its fully crashing, as its seems to reboot super fast but it takes the whole system down and then resets the uptime timer. Quote Link to comment
NeonMinne Posted January 6 Author Share Posted January 6 Little update: Replaced PSU to old, known working one Some settings (mostly around docker) are reverting/changing on reboots sometimes? It "forgot" that I set to IPVLAN It "forgot" i set it to use a folder instead of vdisk Not sure what's going on still, Memcheck also returned good values and temps are normal/CPU cooler is seated properly Quote Link to comment
Valen Posted January 6 Share Posted January 6 verify your ram. use memtest to do so Quote Link to comment
NeonMinne Posted January 6 Author Share Posted January 6 16 minutes ago, Valen said: verify your ram. use memtest to do so Sorry, that's what I meant by above. Memtest returned no errors (granted, only 1 run). Should I run more? Quote Link to comment
NeonMinne Posted January 8 Author Share Posted January 8 (edited) Sorta using this to document troubleshooting in case anyone else hits this issue. After more reboots and no errors at the end of syslog, I double checked all the AMD cstate stuff was off as documented in this thread: Unfortunately, that doesn't help. It seemed stable at first, but then crashes and reboots. I started checking towards the beginning of the logs (in case anything relevant was given), and was given Quote mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: as an error listed. Googling this basically returns "everything under the sun could be the issue but check RAM". At this point I'm running an extended MemTest to see if I can't find the little bug. If not, I honestly don't know next steps aside from trying to RMA the board. Edited January 8 by NeonMinne Added screenshot Quote Link to comment
Solution NeonMinne Posted January 19 Author Solution Share Posted January 19 Oh my god I figured it out. Long story short, I didn't need any new hardware (anyone want some lightly used RAM/Mobo/PSU 😅?). The issue came down to an apparent bug between AMD Ryzen, and LSI HBA cards (Gen2 version). Apparently the CPU and HBA would lose contact with each other and cause a hard reset to happen. I had to turn off Autonegotiation and manually set the lanes in my mobo's PCI-E settings and voila, it worked. I've been hammering it with read/writes to test with no crashes so this seems to be the fix. Thanks to TheArtofServer for tipping me off: https://www.youtube.com/watch?v=b0fAKG3qa6Q Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.