[SOLVED] Recent Frequent Crashing (Not related to unRAID version or macvlan?)

Majawat · April 27, 2021

Honestly, I have no idea what's going on. But recently, I'm having unRAID crashes. I was on 6.9.1 with some crashes, then updated to 6.9.2. Attached are my diags after such a crash. Happened in 6.9.1 and now 6.9.2 as well. And from what I barely understand, I don't have any dockers with custom IPs so not a macvlan issue. Please correct me if I'm wrong on these assumptions.

Unfortunately I really only find out after the crash when I get a notification that a parity check has started. As a result, I'm not really sure of what's going on.

I'm also not physically near my hardware, it's at my brother's house (has better internet). But can IPMI into the box.

And of course, I'm not quite knowledgeable about all this stuff, so please let me know if you need more information or anything.

Much appreciated!

hathor-diagnostics-20210427-1015.zip

Edited April 29, 2021 by Majawat
solved!

JorgeB · April 27, 2021

Enable this then post that log after a crash.

Majawat · April 27, 2021

Here are new logs after turning on Mirror syslog to flash, after next crash.

hathor-diagnostics-20210427-1109.zip

JorgeB · April 27, 2021

1 hour ago, JorgeB said:

post that log after a crash.

Majawat · April 27, 2021

But I did have a crash before I grabbed the diags. I've had a few now since then, here's another diag file taking just after another crash and right after it came back up and started another parity check.

hathor-diagnostics-20210427-1326.zip

Majawat · April 27, 2021

I turned off all my Docker containers and my VMs, and it hasn't crashed in a while now. I'm going to slowly turn on one at a time and see what thing is doing it. I have a feeling it's my new-ish W7 VM, which I hope not.

Edited April 27, 2021 by Majawat
(VM isn't that new, just most recent change)

Majawat · April 27, 2021

I turned on two docker containers to help me test another issue: https://forums.unraid.net/topic/107528-docker-containers-become-slowunusable-during-large-data-movement/

Then everything was ok for a while there. Then I turned on a single VM (not the new-ish one, one I've had for a long time). And then pretty quickly got a crash. Specifically, I turned on the vm called Hraf. Though what's the liklihood that I'd choose the one VM with an issue? I'm guessing it's more so that I'm using any VM that's causing the crashing... I'll try other ones and see what happens.

Edit: It crashed with just a file copy job going, nothing else running; no VMs, no Dockers. Though I think a parity check was going. I'm going to restart in Safe Mode and see what happens.

Edited April 28, 2021 by Majawat

JorgeB · April 28, 2021

12 hours ago, Majawat said:

But I did have a crash before I grabbed the diags.

What do you mean by crash then? Usually crash means the server is unresponsive, you can't even get diags, I don't see anything out of the ordinary logged on that syslog.

Majawat · April 28, 2021

I mean the whole server stops and restarts all on its own. A non-graceful shutdown. Then it comes back up, starts a parity check, and I download the diags.

Almost like a blue screen in Windows, but I don't see anything like that screen here.

ChatNoir · April 28, 2021

The diagnostics you grab after reboot will be with a clean log, probably not much to see.

Can you check the /logs folder of you flash drive ? During an unclean shutdown Unraid tries to generate a diagnostics before rebooting.

If there is a recent file, it might be helpful.

JorgeB · April 28, 2021

53 minutes ago, Majawat said:

I mean the whole server stops and restarts all on its own.

The diagnostics can't help for this, this can:

16 hours ago, JorgeB said:

Enable this then post that log after a crash.

Majawat · April 28, 2021

Oooh, I understand now. My understanding was the diagnostics grabbed the syslogs created by that setting. I get that it's a different file now. I'll post it in the morning (3am now). I'm also running a memtest now.

Thank you for your patience

Majawat · April 28, 2021

Ok, couldn't sleep, so I stopped the memtest and got the syslog and timestamps.

Pings showing it went down and when:

 5:27:14.78 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:15.81 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:16.82 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:17.85 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:22.80 Request timed out.
 5:27:32.80 Request timed out.
 ... (truncated)
 5:29:43.29 Request timed out.
 5:29:45.32 Request timed out.
 5:29:47.35 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:29:49.38 Reply from 192.168.9.10: bytes=32 time=2ms TTL=64
 5:29:51.41 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:29:53.44 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:29:55.47 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64

Attached is the syslog. At the time of crash, all I was doing was navigating from the Dashboard to the Main tab. No docker or VMs were started, and no parity check.

Apr 28 05:22:33 Hathor rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="14293" x-info="https://www.rsyslog.com"] start
Apr 28 05:25:37 Hathor ntpd[2090]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Apr 28 05:30:23 Hathor root: Delaying execution of fix common problems scan for 10 minutes
Apr 28 05:30:23 Hathor unassigned.devices: Mounting 'Auto Mount' Devices...
Apr 28 05:30:23 Hathor emhttpd: Starting services...
Apr 28 05:30:23 Hathor emhttpd: shcmd (81): /etc/rc.d/rc.samba restart

It shows no logs immediately prior to that crash.

Here are my syslog settings

image.png.df468f5aa35d57872da013c9cd79cfba.png

As I was gathering this information, it crashed again despite not using the system. I'm restarting the memtest. but for some reason it only shows 3 slots?

image.png.f45089bff7e63c9f2520243dc41c1ed5.png

syslog-192.168.9.10.log

JorgeB · April 28, 2021

Nothing being logged about the crash usually points to a hardware problem, one more thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

ChatNoir · April 28, 2021

1 hour ago, Majawat said:

As I was gathering this information, it crashed again despite not using the system. I'm restarting the memtest. but for some reason it only shows 3 slots?

Are you using a single CPU ?

Majawat · April 28, 2021

1 minute ago, ChatNoir said:

Are you using a single CPU ?

No, dual xeons.

Interestingly, it does at least show the full capacity there: 48gb

Squid · April 28, 2021

You have ECC Memory. The Memtest that comes with Unraid will not find any memory errors. You need to create a boot stick with the updated one (google search. Licensing prevents LT from including it in the OS)

Majawat · April 28, 2021

1 hour ago, Squid said:

You need to create a boot stick with the updated one (google search. Licensing prevents LT from including it in the OS)

Is that the one from PassMark or the open source one? Or does it matter?

rodan5150 · April 28, 2021

The PassMark one is the one I've used to test ECC Memory. Be sure to boot to UEFI on the Memtest boot stick, not the traditional Memtest86 with the blue screen that is Legacy/BIOS boot. It will have both. The UEFI one is the one that did the trick for me testing ECC, i think the legacy one does not.

Majawat · April 29, 2021

I figured out what is the cause of the server shutting off: https://imgur.com/a/RGSmkJz. Burnt 24 pin connector. Not sure if it's the power supply or the motherboard's fault. But time for some replacement parts. At least motherboard and power supply it seems.

Wonder if I should replace it with some newer hardware, and if so, which...

JonathanM · April 29, 2021

1 minute ago, Majawat said:

Not sure if it's the power supply or the motherboard's fault.

Yes. That kind of damage is typically caused by a loose fit, where the metal on metal doesn't firmly hold, causing only a small spot to touch, causing high resistance and temperature. Technically the power supply end is what is supposed to provide the clamping and spring force, but the board tolerances probably didn't help. The motherboard could possibly be salvaged, by thoroughly cleaning the burnt parts on the metal posts inside the connector, but that would be a serious pain to do, probably with a very small bit on a dremel.

[SOLVED] Recent Frequent Crashing (Not related to unRAID version or macvlan?)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation