Jump to content

Server crash during stream. Machine Check Events detected


Recommended Posts

Hey everyone,

 

My server just crashed while I was streaming on Jellyfin, however this is a really strange crash as my server didn't fully restart. I'm not sure how to explain this, but I didn't hear it go through post and ramping up the chassis fans and then back down. I just heard the fans spin up for a moment and then go back down. However, unraid did have to boot back up, and I was able to connect to it once it was showing the IP address.

 

The server is currently completing a parity check but in fix common problems I have the following error "Machine Check Events detected on your server" "Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the Unraid forums. The output of mcelog (if installed) has been logged  More Information"

 

Trying to get more information for this error but installing NerdPack, I am unable to find that in the app store, I believe it was depreciated? There is, however, NerdTools, but I am unable to find that mcelog tool within it.

 

I have attached the diagnostics from when the UI was accessible, but I am not sure if that would have information as to what happened. Any idea what direction I should move with this?

tower-diagnostics-20230306-2238.zip

Link to comment
7 hours ago, JorgeB said:

Looks like a possible CPU problem:

 

Mar  6 22:36:18 Tower mcelog: CPU 3 on socket 0 received unknown error
Mar  6 22:36:18 Tower mcelog: Location: CPU 3 on socket 0

 

You can try and check the SEL (system event log) to see if there's more info there.

 

Apologies, how would I locate these logs? Is it the ones located at http://tower/log/syslog?

 

If so, I have it attached, however I don't think I am seeing anything that stands out. It seems to be populated from after the server reset.

syslog

Link to comment
52 minutes ago, JorgeB said:

The SEL if it exists will be in the board BIOS, it can usually also be accessed over IPMI

Perfect, I figured out how to get the IPMI going.

 

The only thing as of this year was

Event ID     Time Stamp                 Sensor Name      Sensor Type          Description  

251            12/04/2023 16:41:04     Unknown            [undefined]            undefined - Asserted

250           09/18/2021 15:05:04     Unknown            OS Critical Stop     undefined - Asserted - Asserted

Link to comment
43 minutes ago, JorgeB said:

Looks like not much help, try just with CPU 1, probably needs to be installed on socket 0.

 

Apologies, I don't follow. Its a dual socket motherboard, and both sockets have a X5670 CPU in them.

 

Are you telling me to swap the CPUs in each socket? This server has been running for at least a couple of years in this config, if that is of any help.

Link to comment

Not hijacking this thread but adding my voice - this same thing happened to me today out of the blue. My server seemed to reboot (though I don't recall hearing it despite sitting right next to it) somehow a few hours ago and also got MCE alerts. Array was stopped but configuration valid, and but unclean shutdown detected. It's currently doing a Parity Check.

Server has been completely rock solid since I built it a few months ago. Been on 6.11.5 for a long time, too.

I am concerned that something else may be going on as there are more than a few reports in the last 24hrs of similar behavior.

Edited by horridwilting
Link to comment
On 3/7/2023 at 12:11 PM, JorgeB said:

No, the error mentions CPU 0, but I assume you cannot just remove that CPU, board probably requires CPU 0 to be installed, if so remove CPU 0 and install CPU 1 on CPU 0.

 

Apologies for the delayed response. I noticed that while in the IPMI I was unable to see any sensor information at all. Everything was showing and unknown/unavailable.

 

What I have done since is actually reflash the BIOS and Firmware for my mobo, and now I am seeing proper errors and data from the sensors. I have attached all that I can from the system information screen and a new diagnostics output from unraid.

 

From what I can see, IPMI log seems to be filling up with a bunch of CPU lines, but I am not 100% just yet on what it means. I wanted to get the info posted here before trying to figure out what it's trying to explain.

 

On 3/7/2023 at 3:47 PM, horridwilting said:

Not hijacking this thread but adding my voice - this same thing happened to me today out of the blue. My server seemed to reboot (though I don't recall hearing it despite sitting right next to it) somehow a few hours ago and also got MCE alerts. Array was stopped but configuration valid, and but unclean shutdown detected. It's currently doing a Parity Check.

Server has been completely rock solid since I built it a few months ago. Been on 6.11.5 for a long time, too.

I am concerned that something else may be going on as there are more than a few reports in the last 24hrs of similar behavior.

 

Thank you for the reply, I'm not sure if I would have much faith in my equipment as I am still learning quite a bit and looks like there is still a ton more to learn. So maybe take my post with a grain of salt, could currently just be happenstance.

IPMI Event Log.xlsx IPMI Sensor Readings + Threshholds.xlsx tower-diagnostics-20230312-1635.zip

Link to comment

Alright so quick update, I have swapped the CPUs in their respective sockets after cleaning them up. Cleared the IPMI logs and have just got Unraid back up and running.

 

Lets see what comes up in the logs next, but from the basic googling I have done it seems more likely that the PSU is starting to fail rather than the CPUs.

 

On that topic, does anyone know where I could get a replacement Segate ss-400h2u? From what I am seeing in CAD its like $400-$500 "new". Or does anyone know if a Segate ss-600h2u would be swapable? I know that desktop PSUs are not to be trusted due to the cabling variations, regardless of brand or model. Does the same stand for server PSUs?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...