[SOLVED] System (kernel?) always crashes after 1 - 10 days


jnk22

Recommended Posts

Hello,

 

I am having a problem with regular crashes of my Unraid server.

Those crashes usually happen after 1 - 10 days of runtime. I have attached 13 screenshots of the last screen that I took using my mainboard's remote control via webinterface.

 

About 80% crashes happend during 02:00 AM and 09:00 AM, when nothing much happened at that time. Sometimes they occured during active usage of the server as well though.

In the last ~6 months I have probably had a total amount of ~30 crashes of that kind. Unfortunately I did not have time to address them properly until now.

 

I ruled out a few things already:

  • corrupt appdata on a BTRFS cache drive (running array without any cache drive -> crashes remain)
  • OpenVPN AS Docker: disabled for a few weeks, same crashes

 

My system

  • Unraid 6.8.3 (Unraid DVB Edition)
  • Mainboard: Supermicro X11SPi-TF
  • CPU: Intel Xeon Silver 4210
  • RAM: 4x 32GB Samsung M393A4K40CB1-CRC DDR4-2400 regECC DIMM CL17 Single (not tested with memtest before installing Unraid)
  • GPU: 2x INNO3D GeForce GTX 1650 Single Slot
  • Cache drive: 1000GB Samsung 970 Evo Plus M.2 2280 PCIe 3.0
  • HDDs (Array): 4x WD 8TB Red

 

I bought most of the hardware as completely new hardware back in March/April this year (except for some of the HDDs which I have had already).

I have then extended the server with 2 more RAM sticks and 2 graphics cards in ~May. The problems have been happening before that already, with the 2 starting memory modules I have had installed.

 

My mainboard has 2 network ports, currently being used in bonding mode. I have used one of those ports in single mode before, but the same crashes have happened back then.

 

I have attached some diagnostics files when the server was active. Unfortunately I could not get those diagnostics files after a crash. However, some of those diagnostics logs seem to have the same kind of errors as well.

 

I also managed to log a "crash" when it seemed to actively happen, but only through copying the logs itself from the webinterface console, as I could not download any diagnostics through the webinterface itself at that point anymore:

kernel_error_2020-09-07.txt

 

 

Could these crashes by related to bad RAM? As those RAM modules are regECC-modules, I would have expected those kind of errors to look different though - but I have never used any regECC-RAM before.

 

I am also using Pi-hole docker. As those errors usually contain something about "NAT" and IPv4/6, could this be the cause as well? Disabling Pi-hole for weeks would not be that easy though as I am using its DNS services for DNS resolution of my TV server and other important services as well..

 

Thanks in advance :)

jnk

 

diagnostics.zip screenshots.zip

iKVM_capture_2020-10-04.jpg

Edited by jnk22
Problem solved
Link to comment
18 hours ago, jnk22 said:

Unraid 6.8.3 (Unraid DVB Edition)

I think you mean the linuxserver build of Unraid with DVB drivers or am I wrong?

At least you can try also my build, or build it yourself (prebuilt images in the first post on the bottom).

 

 

 

At least I would try it with a stock installation of Unraid. ;)

  • Like 1
Link to comment

Thanks for your help! :)

 

@ich777 Yes, it's the linuxserver.io build for my Digital Devices TV cards.

But I have already seen your awesome new plugin, and that is what I'll be using for Unraid 6.9.0 for sure :)

 

I managed to move my Pi-hole instance to an old Raspberry Pi 2. I feel like the Pi-hole docker could have been the reason for the crashes as well.

If the server is going to crash again in a few days though, I'll test a stock installation of Unraid in safe-mode without Docker/VM.

 

I'll keep you updated.

jnk22

Link to comment
39 minutes ago, jnk22 said:

I managed to move my Pi-hole instance to an old Raspberry Pi 2.

I don't think it's PiHole becaue I also run it on my server this sounds more like a Hardware problem.

You can also build it with the official DigitalDevices drives, you only have to change tha value from 'libreelec' to 'digitaldevices'.

Link to comment
On 10/10/2020 at 11:55 AM, ich777 said:

I don't think it's PiHole becaue I also run it on my server this sounds more like a Hardware problem.

You can also build it with the official DigitalDevices drives, you only have to change tha value from 'libreelec' to 'digitaldevices'.

Yea, unfortunately it was not Pi-hole 😅

 

To rule out the linuxserver.io DVB kernel (and also to test your awesome new docker) I have built a custom kernel with DVB and NVIDIA support today. Worked like a charm!

 

I have also temporarily removed the memory modules (slot 1+2) that I had initially installed.

As I have checked the amount of times the server started (which most of the time was due to crashes..) before and after I inserted the 2 additional modules, I realized that the amount of starts/crashes before were about every ~2nd day (for 30 days), after the installation about 4,5 days per crash (for 150 days).. Which could lead to the assumption that I reduced the probability of a crashes by 2 with double the amount of modules. 🤔

 

If that won't work I'll try a stock Unraid version as next step.

 

Will keep you updated :)

  • Like 1
Link to comment
12 hours ago, jnk22 said:

memory modules (slot 1+2)

Are you sure that you've put the memory modules in the recommended slots from the manufacturer (typically your manual has the right informations).

 

Have you applied any overclock or XMP profile to the memory?

If so I recommend disabling it because this is a server and it is supposed to run 24/7 and that's not the intention of XMP or any overclock.

In a server environment is stability the key (but also speed - not overclocked).

Link to comment
  • 3 months later...

It seems that I finally have some good news about those crashes :)

 

On 10/12/2020 at 7:54 AM, ich777 said:

Are you sure that you've put the memory modules in the recommended slots from the manufacturer (typically your manual has the right informations).

 

Have you applied any overclock or XMP profile to the memory?

If so I recommend disabling it because this is a server and it is supposed to run 24/7 and that's not the intention of XMP or any overclock.

In a server environment is stability the key (but also speed - not overclocked).

 

Yes, I have looked up the recommended slots for 2 and 4 modules and always placed them in that way.

There is no overclocking enabled as you are totally right about the stability aspect as well.

 

---

 

I have updated to the latest version 6.9.0-RC1 when it was released, but that did not solve it (so it was not solved by the new kernel - which I hoped would fix it).

 

After that I have done the following, which seems to have solved it:

 

- Upgrade to 6.9.0-RC2

- Disabled IPv6 network for Unraid completely

 

Since then, the server is up and running for 37+ days and no error at the logs at all. (maximum was ~15 days before)

I did not restart the server since then, so I hope that it will stay the same for the next (RC3?) release.

 

I believe it is the IPv6 setting that does not make it crash anymore.

It might be due to some misconfiguration on my home network - or some Unraid related problem with the Unraid network settings in general.

 

I have also found a topic about a similar problem:

- https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-to-dockers/

 

It seems to be solved by using multiple NICs / setting up VLANs etc.:

- https://forums.unraid.net/topic/62107-network-isolation-in-unraid-64/?tab=comments#comment-609082&searchlight=1

 

 

As the current state (disabling IPv6 completely) is not what I want, I'll try those solutions in the future. However, my network does not really support VLANs due to my current hardware (FRITZ!Box router).

 

I'll report back as soon as I have done some more testing on those IPv6 and general network-related settings.

Edited by jnk22
  • Like 2
Link to comment
  • 2 months later...

Unfortunately I still have those errors, so disabling IPv6 has not solved this error.

 

The errors occured with all new versions:

- 6.9.2

- 6.9.1

- 6.9.0

- 6.9.0-RCx

- 6.8.3

 

This table shows that kernel crashes still appear regularly, with a longest uptime of 22 days (due to an update):

grafik.png.1e94a158ccadc49c63f688e61cc49d9d.png

 

 

It seems that more users are having those kind of crashes since ~6.9.x:

 

I do not know if this thread is directly related to my problem, as most of those users seem to have their dockers running on "bridge" network only.

 

My docker instances use these 3 network types:

- bridge (only OpenVPN-AS - will try to replace this with WireGuard for now)

- host (2x)

- proxynet (21x - custom network, details attached as .txt file)

 

As I probably cannot set up VLANs due to hardware limitations, I'll try to remove the bridge- and host-dockers by replacing them with alternatives for now to further narrow down the cause.

 

I have attached a file with my docker-network settings and my Unraid network settings as well.

 

unraid_network-settings.png

latest_crash.txt

unraid_docker.png

docker-inspect-network-proxynet.txt

Edited by jnk22
Link to comment
  • 3 months later...

After 90+ days of uptime I can confirm that these changes seem to have solved my problem:

 

- Settings -> Docker -> Host access to custom networks: Disabled

- Use custom network "proxynet" for ALL docker instances (before: host + bridge + proxynet) - might be unrelated

 

It is probably due to disabling host access to custom networks as stated here:

 

Since then there is no call trace to be found in my logs anymore 🙂

 

I'll mark this thread as solved.

  • Like 2
Link to comment
  • jnk22 changed the title to [SOLVED] System (kernel?) always crashes after 1 - 10 days

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.