Jump to content

[Solved] A runtime critical stop occurred


Go to solution Solved by VladDT,

Recommended Posts

Hello all,

I'm facing some troubles with my Unraid server crashing every few days, and I'm not even very sure how to figure out what's going wrong, let alone how to fix it. Here's what I know so far:

 

The server is a Dell PowerEdge R720xd. I believe I upgraded to Unraid 6.12.2 on the 8th of July - I am certain about the version, though not the date, which I've pieced together by looking through the lifecycle log.

 

On the 17th of July, the server became unresponsive and I was unable to connect to it by any means. The iDRAC log had an entry saying that "A runtime critical stop occurred." followed in the next few seconds by three separate "OEM software event"s. I was able to restart the server through the iDRAC by issuing a reset (warm boot) command. The only problem when it came back up was that the Docker image had become corrupt and I had to delete it and recreate all my containers. Looking at the log entries in more detail through IPMI Tools the runtime critical stop has the description "Run-time Critical Stop ; OEM Event Data2 code = 61h ; OEM Event Data3 code = 74h". The OEM software events have the descriptions "Fatal excep", "tion in int", and "errupt" in that order.

 

QUESTION: What do these codes mean? Does the "OEM" designation means they're specific to Unraid? Google's top hit is this forum post which the posters believed to be a memory error, but I don't know if that was gleaned from the event codes or the memory errors that shortly preceded the runtime critical stop in the OP's log. I do not have any memory errors (correctable or otherwise) in my log.

 

Since then, every couple of days the same thing has happened: the server becomes unresponsive, a "runtime critical stop" appears in the log, and the system has to be reset through the iDRAC. At some point, I think on the 22nd, I upgraded to Unraid 6.12.3 (this date is mostly a guess; I haven't been able to distinguish between various commanded restarts). The only difference this made was that the Docker image didn't get corrupted every time the server crashed - which may just be blind luck anyway.

 

In terms of what I've tried, I've been able to preserve the syslogs for a couple of occurrences. These are in the attached file. The one named "syslog" was retained by mirroring the log to the flash drive. A runtime critical stop occurred at 02:37 on the 23rd in the period covered by that file, although you will see that there were no entries between just after midnight on the 23rd and 07:40 when I restarted.

 

QUESTION: Is there a way to increase syslog verbosity?

 

In the file named "syslog-127.0.0.1.log", I got fed up of "Fix Common Problems" moaning at me for mirroring the syslog to flash, so I captured it by sending the logs to Unraid's own syslog server. In the period it covers, runtime critical stops occurred at 08:25 on the 25th and 05:59 on the 26th. Again, there are no log entries particularly close to the crash that suggest what might be happening.

 

The other thing that I did, given the top Google hit mentioned above, was to run Memtest on Sunday. I left all the settings at their defaults, and it found no problems by the time it had run to completion.

 

QUESTION: Am I correct in my understanding that Memtest is not an exhaustive test, and a pass does not therefore absolutely guarantee that there is no failing DIMM in the system?

QUESTION: Are there different settings for Memtest that I should try that might make it more comprehensive?

 

Finally, what other information can I retrieve to help diagnose this problem. I have omitted the usual Unraid diagnostics package, as by the time the problem has occurred I'm unable to communicate with Unraid until it has restarted and obliviated any diagnostics.

 

QUESTION: Or could there be useful things in there? Would it be helpful to post a diagnostics package anyway?

QUESTION: Is there anything else I should try to investigate to help find out what's going wrong?

 

Thanks if you've made it all the way down to the bottom of this mammoth post, and thank you in advance for all your helpful suggestions and advice that will no doubt be forthcoming.

 

Edited by ScottAS2
Remove diagnostics
Link to comment
48 minutes ago, Richard Aarnink said:

I am starting to seeing a pattern @halorrr and @Hibiki Houjou and me are experiencing similar issues.
The system freezes randomly, there is no change in the hardware. We are all on 6.12.2 or 6.12.3. 

Yes, our four experiences do seem remarkably similar. Until I saw your experiences I expected the 6.12.2 upgrade was a coincidence due to the initial nine days of stability, but now I'm much more suspicious of it. I'll give safe mode a try to rule out one more thing.

Link to comment

it started happening arround 2023-04-29 when i had downgraded and removed unnecessary hardware from my PC ,

 

it did not crash every day,  it randomly shut down the pc in a frozen state, with a delay of happening between 2-3 weeks,

this could be some hardware failure i do not know off and not able to figure out ,

 

the only thing i can think off is that the CPU is defective,

but that still don't explain it why the delay between it happening

 

i did some steps like swapping ram modules around and re-seat HBA, CPU, RAM,

cleaning the connectors, update bios and reset it , no overclocking profiles active,

 

the only changed i did apply when downgrading are

 

PSU Swap:

i swapped the Seasonic PX-750 from my gaming PC witn my unRAID Seasonic TX-1300

 

CPU Change:

from a R5 2600 to a R5 5600g

 

Removed  Expansion Card's:

Intel Raid Expander

(well it basically has noting to do with it is it was powered by molex instead using the pci connector)

 

Removed MSI 1050TI GPU

 

HDD's Changes:

 

from:
2x TOSHIBA MG09 18TB Enterprise (MG09ACA18TE)
6x TOSHIBA MG07 14TB Enterprise (MG07ACA14TE)
4x Western Digital RED 10 TB (WD100EFAX-68LHPN0)

 

to:

2x TOSHIBA MG09 18TB Enterprise (MG09ACA18TE)
1x TOSHIBA MG07 14TB Enterprise (MG07ACA14TE)
3x Western Digital Ultrastar 22 TB (WUH722222ALE6L4)

 

SSD:
No Change

 

RAM:

No Change

 

Motherboard:

No Change

 

when it freezes power / reset button don't work but power led and nic led's are active,

my CPU wraith cooler was not active or lit,

i had to pull wall cord for 30 sec to able to start PC again

 

for some reason nothing happened in like 6 weeks and suddenly i have active state freeze,

PC is fully operational but unraid not,

and last week while mid parity check it actively froze again,

that was like between 2 days since parity check takes up 1-2 days

Edited by Hibiki Houjou
Link to comment

So I've been up for two days, ten hours in safe mode with one significant weirdness. The parity check I was running cancelled itself a couple of hours before it was due to finish. It then ran another parity check for all of thirty seconds. Very weird. My end-of-the-month parity check is coming up soon. Let's see how that goes.

Link to comment
  • 4 weeks later...

I have the same issue, during past tree days. From IDRAC log:

 

 Not ApplicableOEM software event.

 Not ApplicableOEM software event.

 Not ApplicableOEM software event.

 Sun Aug 20 2023 15:37:53A runtime critical stop occurred.

 Not ApplicableOEM software event.

 Not ApplicableOEM software event.

 Not ApplicableOEM software event.

 Thu Aug 17 2023 20:50:10A runtime critical stop occurred.

 

Server is PowerEdge T320. No output from local VGA. No hardware changes. I did upgrade to 6.12.3 while a go...

 

Lucklily I have redundant Pihole (DHCP/DNS) running on a PI, so there is no "Internet is down" situation. But this is annoying newertheless.

Link to comment
27 minutes ago, Daatta said:

I have the same issue, during past tree days. From IDRAC log:

 

 Not ApplicableOEM software event.

 Not ApplicableOEM software event.

 Not ApplicableOEM software event.

 Sun Aug 20 2023 15:37:53A runtime critical stop occurred.

Yes, this looks just like what I've got. I managed to get uptime to over seven days, slowly bringing back Docker services as I grew more confident. A few hours after I brought up my ADS-B constellation, the server crashed. Thinking I'd found the culprit, I brought up everything except ADS-B, but the server crashed again a few days later. I'm therefore of the opinion that the probability of crashing increases the more heavily-loaded the server is, but I still don't know an exact cause, let alone a solution. If anyone knows the answers to any of the questions I posed above, please speak up!

Link to comment

Have the same problems. 

3814 | Aug-28-2023 | 08:16:36 | Sensor #70       | OS Critical Stop         | Run-time Critical Stop ; OEM Event Data2 code = 61h ; OEM Event Data3 code = 74h

 

Server - HP-ProLiant MicroServer gen8. 

At first, I thought about memory error. But nothing in IML logs at all, and ILO telling memory is healthy. 

All started after upgrading to version 6.16.23.

 

After years of using Unraid, I'm really thinking to move to Asus Lockerstore.

Link to comment

Currently my Uptime is 7 days, 23 hours, 43 minutes, no issues so far... I haven't done anything to decrease the load, but with 10/20 core Xeon the server is mostly idling at 7% CPU.  I did forward syslog to an external server (old Qnap) to capture events before they are lost due to a server restart. 

 

I have been using Unraid for 4,5 years, so far this is the only serious fault I have witnessed. With old e-waste server(s) that is a pretty good score in my book. 

Link to comment
  • 2 weeks later...

So I've just passed seven days with almost everything up on 6.12.4. I wonder if it was related to the macvlan changes, since I am using that for Docker with a parent interface that is a bridge. Does anyone know where I would have been expected to see the "macvlan call traces and crashes" referred to in the release notes if that was the cause of my crashes?

Link to comment

I just got this same issue for the first time this morning:
image.thumb.png.659a7ce5dc716479705b0f82408138c3.png

 

Also, on 6.12.4 but only thing that is different is last night I had an unsafe shutdown, which corrupted my docker.img - I rebuilt this around midnight and got all containers up and running.

 

About 6 hours later I see the server is dead with this same error.

 

Obviously all docker settings reset to default, and mine is set to macvlan now.

Will also be trying ipvlan - Can't remember if that is what I used to have it on.

Link to comment
  • 2 weeks later...
  • 2 weeks later...
  • ScottAS2 changed the title to [Solved] A runtime critical stop occurred
  • 5 weeks later...

I am also experience random crashes with an R730 over the last few months and see the same events in my IDRAC Lifecycle Logs:
 

  2023-10-31T16:31:34-0500 SEL9901 OEM software event.

  2023-10-31T16:31:31-0500 SEL9901 OEM software event.

  2023-10-31T16:31:27-0500 SEL9901 OEM software event.

  2023-10-31T16:31:24-0500 OSE0001 A runtime critical stop occurred.


Will try switching my Docker custom network type over to Ipvlan and see if that resolves it. 

Update:
Looks like its a known issue that should be resolved in 6.12.4:

 

https://docs.unraid.net/unraid-os/release-notes/6.12.4/

 

Fix for macvlan call traces
The big news in this release is that we have resolved issues related to macvlan call traces and crashes!
The root of the problem is that macvlan used for custom Docker networks is unreliable when the parent interface is a bridge (like br0), it works best on a physical interface (like eth0) or a bond (like bond0). We believe this to be a longstanding kernel issue and have posted a bug report.
If you are getting call traces related to macvlan, as a first step we recommend navigating to Settings > Docker, switch to advanced view, and change the "Docker custom network type" from macvlan to ipvlan. This is the default configuration that Unraid has shipped with since version 6.11.5 and should work for most systems.
However, some users have reported issues with port forwarding from certain routers (Fritzbox) and reduced functionality with advanced network management tools (Ubiquity) when in ipvlan mode.
For those users, we have a new method that reworks networking to avoid this. Tweak a few settings and your Docker containers, VMs, and WireGuard tunnels should automatically adjust to use them:
Settings > Network Settings > eth0 > Enable Bonding = Yes or No, either work with this solution
Settings > Network Settings > eth0 > Enable Bridging = No
Settings > Docker > Host access to custom networks = Enabled
Note: if you previously used the 2-nic docker segmentation method, you will also want to revert that:
Settings > Docker > custom network on interface eth0 or bond0 (i.e. make sure eth0/bond0 is configured for the custom network, not eth1/bond1)
When you Start the array, the host, VMs, and Docker containers will all be able to communicate, and there should be no more call traces!

 

Edited by ZombieLord
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...