Majawat Posted April 27, 2021 Posted April 27, 2021 (edited) Honestly, I have no idea what's going on. But recently, I'm having unRAID crashes. I was on 6.9.1 with some crashes, then updated to 6.9.2. Attached are my diags after such a crash. Happened in 6.9.1 and now 6.9.2 as well. And from what I barely understand, I don't have any dockers with custom IPs so not a macvlan issue. Please correct me if I'm wrong on these assumptions. Unfortunately I really only find out after the crash when I get a notification that a parity check has started. As a result, I'm not really sure of what's going on. I'm also not physically near my hardware, it's at my brother's house (has better internet). But can IPMI into the box. And of course, I'm not quite knowledgeable about all this stuff, so please let me know if you need more information or anything. Much appreciated! hathor-diagnostics-20210427-1015.zip Edited April 29, 2021 by Majawat solved! Quote
JorgeB Posted April 27, 2021 Posted April 27, 2021 Enable this then post that log after a crash. Quote
Majawat Posted April 27, 2021 Author Posted April 27, 2021 Here are new logs after turning on Mirror syslog to flash, after next crash. hathor-diagnostics-20210427-1109.zip Quote
JorgeB Posted April 27, 2021 Posted April 27, 2021 1 hour ago, JorgeB said: post that log after a crash. Quote
Majawat Posted April 27, 2021 Author Posted April 27, 2021 But I did have a crash before I grabbed the diags. I've had a few now since then, here's another diag file taking just after another crash and right after it came back up and started another parity check. hathor-diagnostics-20210427-1326.zip Quote
Majawat Posted April 27, 2021 Author Posted April 27, 2021 (edited) I turned off all my Docker containers and my VMs, and it hasn't crashed in a while now. I'm going to slowly turn on one at a time and see what thing is doing it. I have a feeling it's my new-ish W7 VM, which I hope not. Edited April 27, 2021 by Majawat (VM isn't that new, just most recent change) Quote
Majawat Posted April 27, 2021 Author Posted April 27, 2021 (edited) I turned on two docker containers to help me test another issue: https://forums.unraid.net/topic/107528-docker-containers-become-slowunusable-during-large-data-movement/ Then everything was ok for a while there. Then I turned on a single VM (not the new-ish one, one I've had for a long time). And then pretty quickly got a crash. Specifically, I turned on the vm called Hraf. Though what's the liklihood that I'd choose the one VM with an issue? I'm guessing it's more so that I'm using any VM that's causing the crashing... I'll try other ones and see what happens. Edit: It crashed with just a file copy job going, nothing else running; no VMs, no Dockers. Though I think a parity check was going. I'm going to restart in Safe Mode and see what happens. Edited April 28, 2021 by Majawat Quote
JorgeB Posted April 28, 2021 Posted April 28, 2021 12 hours ago, Majawat said: But I did have a crash before I grabbed the diags. What do you mean by crash then? Usually crash means the server is unresponsive, you can't even get diags, I don't see anything out of the ordinary logged on that syslog. Quote
Majawat Posted April 28, 2021 Author Posted April 28, 2021 I mean the whole server stops and restarts all on its own. A non-graceful shutdown. Then it comes back up, starts a parity check, and I download the diags. Almost like a blue screen in Windows, but I don't see anything like that screen here. Quote
ChatNoir Posted April 28, 2021 Posted April 28, 2021 The diagnostics you grab after reboot will be with a clean log, probably not much to see. Can you check the /logs folder of you flash drive ? During an unclean shutdown Unraid tries to generate a diagnostics before rebooting. If there is a recent file, it might be helpful. Quote
JorgeB Posted April 28, 2021 Posted April 28, 2021 53 minutes ago, Majawat said: I mean the whole server stops and restarts all on its own. The diagnostics can't help for this, this can: 16 hours ago, JorgeB said: Enable this then post that log after a crash. Quote
Majawat Posted April 28, 2021 Author Posted April 28, 2021 Oooh, I understand now. My understanding was the diagnostics grabbed the syslogs created by that setting. I get that it's a different file now. I'll post it in the morning (3am now). I'm also running a memtest now. Thank you for your patience Quote
Majawat Posted April 28, 2021 Author Posted April 28, 2021 Ok, couldn't sleep, so I stopped the memtest and got the syslog and timestamps. Pings showing it went down and when: 5:27:14.78 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:27:15.81 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:27:16.82 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:27:17.85 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:27:22.80 Request timed out. 5:27:32.80 Request timed out. ... (truncated) 5:29:43.29 Request timed out. 5:29:45.32 Request timed out. 5:29:47.35 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:29:49.38 Reply from 192.168.9.10: bytes=32 time=2ms TTL=64 5:29:51.41 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:29:53.44 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 5:29:55.47 Reply from 192.168.9.10: bytes=32 time<1ms TTL=64 Attached is the syslog. At the time of crash, all I was doing was navigating from the Dashboard to the Main tab. No docker or VMs were started, and no parity check. Apr 28 05:22:33 Hathor rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="14293" x-info="https://www.rsyslog.com"] start Apr 28 05:25:37 Hathor ntpd[2090]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Apr 28 05:30:23 Hathor root: Delaying execution of fix common problems scan for 10 minutes Apr 28 05:30:23 Hathor unassigned.devices: Mounting 'Auto Mount' Devices... Apr 28 05:30:23 Hathor emhttpd: Starting services... Apr 28 05:30:23 Hathor emhttpd: shcmd (81): /etc/rc.d/rc.samba restart It shows no logs immediately prior to that crash. Here are my syslog settings As I was gathering this information, it crashed again despite not using the system. I'm restarting the memtest. but for some reason it only shows 3 slots? syslog-192.168.9.10.log Quote
JorgeB Posted April 28, 2021 Posted April 28, 2021 Nothing being logged about the crash usually points to a hardware problem, one more thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. 1 Quote
ChatNoir Posted April 28, 2021 Posted April 28, 2021 1 hour ago, Majawat said: As I was gathering this information, it crashed again despite not using the system. I'm restarting the memtest. but for some reason it only shows 3 slots? Are you using a single CPU ? Quote
Majawat Posted April 28, 2021 Author Posted April 28, 2021 1 minute ago, ChatNoir said: Are you using a single CPU ? No, dual xeons. Interestingly, it does at least show the full capacity there: 48gb Quote
Squid Posted April 28, 2021 Posted April 28, 2021 You have ECC Memory. The Memtest that comes with Unraid will not find any memory errors. You need to create a boot stick with the updated one (google search. Licensing prevents LT from including it in the OS) 1 Quote
Majawat Posted April 28, 2021 Author Posted April 28, 2021 1 hour ago, Squid said: You need to create a boot stick with the updated one (google search. Licensing prevents LT from including it in the OS) Is that the one from PassMark or the open source one? Or does it matter? Quote
rodan5150 Posted April 28, 2021 Posted April 28, 2021 The PassMark one is the one I've used to test ECC Memory. Be sure to boot to UEFI on the Memtest boot stick, not the traditional Memtest86 with the blue screen that is Legacy/BIOS boot. It will have both. The UEFI one is the one that did the trick for me testing ECC, i think the legacy one does not. Quote
Majawat Posted April 29, 2021 Author Posted April 29, 2021 I figured out what is the cause of the server shutting off: https://imgur.com/a/RGSmkJz. Burnt 24 pin connector. Not sure if it's the power supply or the motherboard's fault. But time for some replacement parts. At least motherboard and power supply it seems. Wonder if I should replace it with some newer hardware, and if so, which... Quote
JonathanM Posted April 29, 2021 Posted April 29, 2021 1 minute ago, Majawat said: Not sure if it's the power supply or the motherboard's fault. Yes. That kind of damage is typically caused by a loose fit, where the metal on metal doesn't firmly hold, causing only a small spot to touch, causing high resistance and temperature. Technically the power supply end is what is supposed to provide the clamping and spring force, but the board tolerances probably didn't help. The motherboard could possibly be salvaged, by thoroughly cleaning the burnt parts on the metal posts inside the connector, but that would be a serious pain to do, probably with a very small bit on a dremel. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.