Figuring out what is causing a server crash


Recommended Posts

I am no Guru when it comes to analyzing syslogs.  However, I do have some observations about what you posted up.  You really need to tell us why you think these logs are significant.  You also need to tell why (as it appears), you edited the logs the way that you did.   You  also did not tell us how the logs were captured.    Does the end of the log include actual lockup?

 

Describe what you mean by the term-- 'unresponsive.'  Do you lose GUI access, file access, do VM's become unresponsive.

 

With that out of the way, there is a 'Syslog Server'  built into all recent Unraid releases.  You can read about setting it up here:

 

        https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-781601

 

Now you do have a lot of events in the logs of conditions that are probably not normal.  However, what is usually required to find the source of the problem is the event that triggered that sequence. 

 

EDIT:  On more thing:  Did you made any changes to your Unraid server (upgrade, new hardware, New dockers, VM's, Etc.) that proceeded the onset of the problem.

Edited by Frank1940
Link to comment

One more voice to the crowd, I suppose -- the snipped logs do contain a good deal of panic information, but without a bit more context as to what was going on, timeframes, when it crashed/rebooted/was powered off, it's just a bunch of crash information. Part of the reason we need more logdata is context, as trurl mentioned -- what happened before, what happened after, what kept running, what didn't. Also, what you missed. Nobody's perfect, I'm not trying to be accusatory.

 

With so little information, we have to infer a lot, even including which version of unRAID you're running. Very important, because 6.9.* has known issues with Docker macvlan causing crashes -- I didn't look through the entire log because honestly the lack of context makes it pretty useless, but at least one of your crashes looked similar to that issue. There's also known problems with certain Ryzen editions, I've seen issues on certain Intel Xeon hardware (mine and others) caused by C-States...

 

Basically, we'd love to help, but we need you to untie our hands by uploading that diagnostics.zip file -- if you're curious about the security/privacy implications, A) we don't care enough to do anything malicious with the information we're just here to help, and B) I gave a quick run-through of what's contained here, ignore the title;

 

Link to comment

@codefaux@trurl@Frank1940 Thank you for responding, sorry about my first post being so vague, I was at my wits end trying to figure out what was going on. I've had a server lock up maybe once every 7-9 days. The web ui becomes unresponsive and then there will be a bunch of Kernal messages on the server monitor and I would have to restart the server to fix it. The logs I posted were from roughly the time of the crash, I had read up about syslog server after the first server crash as I noticed there were no logs kept after the restart. So the 2 logs I posted before hand were of roughly the time of the crash, I've attached the diagnostics to this po st and the logs from the  Syslog. I also learnt about the Macvlan, unsure if this was the only issue. I had set most of my docker containers to use static IPs, since this post I have reverted them back to using bridge with the exception of PiHole, unfortunately I noticed some further errors in the log and managed to restart teh server before it locked up, I have now disabled pihole to see if the issue really is macvlan error, I currently have no docker containers with a static IP. Hoping you can take a look at the logs and tell me if it could be something else?

 

I haven't really made many hardware changes recently, however this is a recently put together build.

 

I think the errors that cause the crash look like this:

 

Please let me know if I can post anymore info and thank you for your time.

 

Edited by bk101
Link to comment

@JorgeB@trurl

Thanks for checking and posting me those links, just want to confirm that the 3 crashes were all linked to Macvlan issue? so by defaulting back to bridge mode and disabling pihole for the time being should have solved my issue?

The 3 crashes were from 14th April 21:00, 24th April 21:00 and 2nd May 18:20.

Link to comment
47 minutes ago, JorgeB said:

There are macvlan call traces and those are known to end up crashing the server, I linked the last thread because there are some reports that just disabling host access to custom networks also fixes the issue for some.

Will this break my Swag proxy?

Link to comment
15 hours ago, bk101 said:

3 crashes were all linked to Macvlan issue?

One of the problems with trace-inducing crashes is that they present in slightly different ways from a single cause, at times. It can be difficult to guarantee, but I agree that the issue should be related.

 

Contextually, your swag proxy's breakage depends on where it's being accessed from. If your local network and/or other Docker containers are using it, they should be okay. The Unraid OS itself is the only device which should lose access to it.

 

I don't mean to be insulting, it's just that Should does a lot of work with technology, so unfortunately it's hard not to name drop. lol

 

 

Link to comment
8 hours ago, codefaux said:

One of the problems with trace-inducing crashes is that they present in slightly different ways from a single cause, at times. It can be difficult to guarantee, but I agree that the issue should be related.

 

Contextually, your swag proxy's breakage depends on where it's being accessed from. If your local network and/or other Docker containers are using it, they should be okay. The Unraid OS itself is the only device which should lose access to it.

 

I don't mean to be insulting, it's just that Should does a lot of work with technology, so unfortunately it's hard not to name drop. lol

 

 

I see, I guess my unraid doesn't need to access it. I just remember turning this on when setting up Swag. I could always turn it off and test if it breaks anything and then go from there.

 

Edit: Actually found host access to custom networks is already turned off, I have "Preserve user defined networks" turned on.

Edited by bk101
Link to comment

Okay so I'm getting a little foggy on the details bouncing around helping a bunch of people and things have been a bit rough the last few days so bear with me please, but --

 

First, please do me a favor - clarify the state we're in right now, as far as describing what your current problem symptom is, how it presents, and what you know. Also, a fresh diagnostics.zip as a recent reference. I know it's all in the thread but a summary post every now and again really helps, I'd appreciate if you could humor me.

 

Second, I combed your first syslog again. I saw three lines which escaped me before, but seem significant.

Apr 19 04:40:05 Frankenstein root: Fix Common Problems: Error: Unable to write to disk2
--
Apr 19 05:27:00 Frankenstein root: Fix Common Problems: Error: Unable to write to disk2
--
Apr 21 04:40:05 Frankenstein root: Fix Common Problems: Error: Unable to write to disk4
--

I don't remember you mentioning disk problems at any point, and I didn't SEE any disk problems (you have many disks so I'll admit I don't remember if I looked at each log individually) which is part of why I'm asking again for a summary and fresh diagnostics. Ideally that gets us somewhere better...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.