[SOLVED] Recent Frequent Crashing (Not related to unRAID version or macvlan?)


Recommended Posts

Honestly, I have no idea what's going on. But recently, I'm having unRAID crashes. I was on 6.9.1 with some crashes, then updated to 6.9.2. Attached are my diags after such a crash. Happened in 6.9.1 and now 6.9.2 as well. And from what I barely understand, I don't have any dockers with custom IPs so not a macvlan issue. Please correct me if I'm wrong on these assumptions.

 

Unfortunately I really only find out after the crash when I get a notification that a parity check has started. As a result, I'm not really sure of what's going on.

 

I'm also not physically near my hardware, it's at my brother's house (has better internet). But can IPMI into the box. 

And of course, I'm not quite knowledgeable about all this stuff, so please let me know if you need more information or anything.

 

Much appreciated!

 

hathor-diagnostics-20210427-1015.zip

Edited by Majawat
solved!
Link to comment
Posted (edited)

I turned off all my Docker containers and my VMs, and it hasn't crashed in a while now. I'm going to slowly turn on one at a time and see what thing is doing it. I have a feeling it's my new-ish W7 VM, which I hope not.

Edited by Majawat
(VM isn't that new, just most recent change)
Link to comment
Posted (edited)

I turned on two docker containers to help me test another issue: https://forums.unraid.net/topic/107528-docker-containers-become-slowunusable-during-large-data-movement/

 

 

Then everything was ok for a while there. Then I turned on a single VM (not the new-ish one, one I've had for a long time). And then pretty quickly got a crash. Specifically, I turned on the vm called Hraf. Though what's the liklihood that I'd choose the one VM with an issue? I'm guessing it's more so that I'm using any VM that's causing the crashing... I'll try other ones and see what happens.

 

Edit: It crashed with just a file copy job going, nothing else running; no VMs, no Dockers. Though I think a parity check was going. I'm going to restart in Safe Mode and see what happens.

Edited by Majawat
Link to comment
12 hours ago, Majawat said:

But I did have a crash before I grabbed the diags.

What do you mean by crash then? Usually crash means the server is unresponsive, you can't even get diags, I don't see anything out of the ordinary logged on that syslog.

Link to comment

I mean the whole server stops and restarts all on its own. A non-graceful shutdown. Then it comes back up, starts a parity check, and I download the diags. 

 

Almost like a blue screen in Windows, but I don't see anything like that screen here. 

Link to comment

The diagnostics you grab after reboot will be with a clean log, probably not much to see.

Can you check the /logs folder of you flash drive ? During an unclean shutdown Unraid tries to generate a diagnostics before rebooting.

If there is a recent file, it might be helpful.

Link to comment

Oooh, I understand now. My understanding was the diagnostics grabbed the syslogs created by that setting. I get that it's a different file now. I'll post it in the morning (3am now). I'm also running a memtest now. 

 

Thank you for your patience

Link to comment

Ok, couldn't sleep, so I stopped the memtest and got the syslog and timestamps.

 

Pings showing it went down and when:

 5:27:14.78 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:15.81 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:16.82 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:17.85 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:27:22.80 Request timed out.
 5:27:32.80 Request timed out.
 ... (truncated)
 5:29:43.29 Request timed out.
 5:29:45.32 Request timed out.
 5:29:47.35 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:29:49.38 Reply from 192.168.9.10: bytes=32 time=2ms TTL=64
 5:29:51.41 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:29:53.44 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64
 5:29:55.47 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64

 

Attached is the syslog. At the time of crash, all I was doing was navigating from the Dashboard to the Main tab. No docker or VMs were started, and no parity check.

 

Apr 28 05:22:33 Hathor rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="14293" x-info="https://www.rsyslog.com"] start
Apr 28 05:25:37 Hathor ntpd[2090]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Apr 28 05:30:23 Hathor root: Delaying execution of fix common problems scan for 10 minutes
Apr 28 05:30:23 Hathor unassigned.devices: Mounting 'Auto Mount' Devices...
Apr 28 05:30:23 Hathor emhttpd: Starting services...
Apr 28 05:30:23 Hathor emhttpd: shcmd (81): /etc/rc.d/rc.samba restart

 

It shows no logs immediately prior to that crash.

 

Here are my syslog settings

image.png.df468f5aa35d57872da013c9cd79cfba.png

 

 

As I was gathering this information, it crashed again despite not using the system. I'm restarting the memtest. but for some reason it only shows 3 slots?

image.png.f45089bff7e63c9f2520243dc41c1ed5.png

syslog-192.168.9.10.log

Link to comment

Nothing being logged about the crash usually points to a hardware problem, one more thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

  • Like 1
Link to comment

You have ECC Memory.  The Memtest that comes with Unraid will not find any memory errors.  You need to create a boot stick with the updated one (google search.  Licensing prevents LT from including it in the OS)

  • Like 1
Link to comment

The PassMark one is the one I've used to test ECC Memory. Be sure to boot to UEFI on the Memtest boot stick, not the traditional Memtest86 with the blue screen that is Legacy/BIOS boot. It will have both. The UEFI one is the one that did the trick for me testing ECC, i think the legacy one does not.

Link to comment

I figured out what is the cause of the server shutting off: https://imgur.com/a/RGSmkJz. Burnt 24 pin connector. Not sure if it's the power supply or the motherboard's fault. But time for some replacement parts. At least motherboard and power supply it seems.

 

Wonder if I should replace it with some newer hardware, and if so, which...

pinsburnt.jpg

moboburnt.jpg

Link to comment
  • Majawat changed the title to [SOLVED] Recent Frequent Crashing (Not related to unRAID version or macvlan?)
1 minute ago, Majawat said:

Not sure if it's the power supply or the motherboard's fault.

Yes. That kind of damage is typically caused by a loose fit, where the metal on metal doesn't firmly hold, causing only a small spot to touch, causing high resistance and temperature. Technically the power supply end is what is supposed to provide the clamping and spring force, but the board tolerances probably didn't help. The motherboard could possibly be salvaged, by thoroughly cleaning the burnt parts on the metal posts inside the connector, but that would be a serious pain to do, probably with a very small bit on a dremel.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.