Crashing daily and have to cycle power just to get back to UI

randomjohn · November 29, 2020

v6.8.3 (registered)

Plugins: Fix common problems, unassigned devices, CA Backup / Restore Appdata, Community Applications, Dynamix S3 Sleep, Dynamix SSD TRIM, Dynamix System Buttons, Dynamix System Information, Dynamix System Statistics, Dynamix System Temperature, Nerd Tools, Unassigned Devices Plus

Dockers: binhex-sabnzbd, plexinc plex media server, linuxserver radarr/sonarr/lidarr

Hardware:

Model: Custom

M/B: ASUSTeK COMPUTER INC. PRIME Z390-A Version Rev 1.xx - s/n: 200873513103219

BIOS: American Megatrends Inc. Version 1602. Dated: 06/04/2020

CPU: Intel® Core™ i9-9900K CPU @ 3.60GHz

HVM: Enabled

IOMMU: Enabled

Cache: 512 KiB, 2048 KiB, 16384 KiB

Memory: 32 GiB DDR4 (max. installable capacity 64 GiB)

Network: bond0: fault-tolerance (active-backup), mtu 1500
eth0: 1000 Mbps, full duplex, mtu 1500

Kernel: Linux 4.19.107-Unraid x86_64

Diagnostics attached. Thanks in advance for your help.

tower-diagnostics-20201128-1928.zip

JorgeB · November 29, 2020

Try this and post that log after a crash, if you haven't yet also a good idea to run memtest.

randomjohn · November 30, 2020

Not sure if the syslog worked, but I've attached it. MemTest looks good - although I couldn't run it from the Unraid boot - I had to go to the website and create a new USB with just that on it to run MemTest. Could that be part of the problem?

MemTest86-Report-20201129-111840.html syslog

JorgeB · November 30, 2020

4 hours ago, randomjohn said:

Could that be part of the problem?

Memtest only works if you boot in legacy mode, it won't in UEFI mode, nothing on the syslog that I can see, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

randomjohn · December 1, 2020

OK thanks for that. I'm now getting an MCE warning from fix common problems. Any chance you see something in the attached, or is there a specific MCE area?

(The machine is headless and I haven't moved it back to a monitor yet so unless there's an option to boot to safe mode from the GUI, I haven't been able to do that yet - I stupidly put it away after running memtest)

tower-diagnostics-20201201-1205.zip

JorgeB · December 1, 2020

IIRC they show up in the syslog with mcelog installed, but in any case it points to a hardware problem.

randomjohn · December 1, 2020

OK, I'm trying to sift through the syslog now. I did waste a lot of space with csrf_token errors from having browser sessions open across a reboot. This is the only thing I've found that looks like it might be on point and I can't make heads or tails of it:

Dec 1 12:47:07 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server

Dec 1 12:47:07 Tower root: Hardware event. This is not a software error.

Dec 1 12:47:07 Tower root: MCE 0

Dec 1 12:47:07 Tower root: CPU 0 BANK 0 TSC 23a4c0e61fe6

Dec 1 12:47:07 Tower root: ADDR 1ffff8107baa3

Dec 1 12:47:07 Tower root: TIME 1606800346 Tue Dec 1 00:25:46 2020

Dec 1 12:47:07 Tower root: MCG status:

Dec 1 12:47:07 Tower root: MCi status:

Dec 1 12:47:07 Tower root: Corrected error

Dec 1 12:47:07 Tower root: Error enabled

Dec 1 12:47:07 Tower root: MCi_ADDR register valid

Dec 1 12:47:07 Tower root: MCA: Instruction CACHE Level-0 Instruction-Fetch Error

Dec 1 12:47:07 Tower root: STATUS 9400004000040150 MCGSTATUS 0

Dec 1 12:47:07 Tower root: MCGCAP c0e APICID 0 SOCKETID 0

Dec 1 12:47:07 Tower root: MICROCODE d6

Dec 1 12:47:07 Tower root: CPUID Vendor Intel Family 6 Model 158

JorgeB · December 2, 2020

Start by running memtest.

Wingold · December 2, 2020

Hello randomjohn at the Moment i have the same Problem in this Post

the same Problem was caused by the Plex Container an his Maintenance run maybe you can disable the Docker temporaly and test if there are further crashes.
I did this today too

randomjohn · December 5, 2020

On 12/2/2020 at 2:52 AM, JorgeB said:

Start by running memtest.

I ran both Memtest86+ 5.01 (thanks for your heads up to boot in legacy mode) and MemTest 86 v0.4. They each ran for at least four passes and about 20 additional hours and returned no errors.

I've been running error-free in Safe Mode for ~24 hours now. Will be rather annoyed (but relieved) if it's the same problem @Wingold refers to - at least I'll know, but a significant part of the hardware purchase and Unraid installation was to up the horsepower for Plex streaming.

randomjohn · December 7, 2020

Any of this look interesting or relevant? Still trying to figure out what's going on with the ntp server.

Dec 6 17:56:21 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Dec 6 17:56:21 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Dec 6 18:02:00 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Dec 6 18:13:37 Tower kernel: mce: [Hardware Error]: Machine check events logged
Dec 6 18:19:58 Tower kernel: mce: [Hardware Error]: Machine check events logged
Dec 6 18:25:50 Tower kernel: python3[9144]: segfault at 689b401a4b3f ip 000014ebe5c85646 sp 000014ebd31f7160 error 4 in libpython3.8.so.1.0[14ebe5bc0000+1e7000]
Dec 6 18:25:50 Tower kernel: Code: f8 41 89 f1 42 ff 24 f1 0f 1f 40 00 4c 89 ee 4c 89 3e 4c 8d 6e 08 4d 85 ff 0f 84 62 e8 ff ff 44 8b b3 50 02 00 00 45 85 f6 0f <85> 6c b7 ff ff 44 8b 9b 4c 02 00 00 4c 89 d7 48 2b 3c 24 45 85 db

JorgeB · December 7, 2020

8 hours ago, randomjohn said:

Any of this look interesting or relevant?

They point to a hardware problem, but don't identify the component.

randomjohn · December 7, 2020

That's what I was afraid of. Any recommendations for good diagnostics? Weird that it would start after two months. Usually it's right away or after a long time. I did add and move fans, which significantly dropped internal temps but maybe that was too late.

I've attached the most-recent diagnostics, but I don't really know what I'm looking for and nothing leapt out at me.

Thanks for all your help.

tower-diagnostics-20201206-2253.zip

Edited December 7, 2020 by randomjohn

JorgeB · December 7, 2020

Anything can go bad at any time, if the diags don't point to the culprit basically you'd need to start swapping components around to find it.

randomjohn · December 11, 2020

I guess we can consider this one closed but not solved.

While it ran properly before a BIOS update came out recently for my motherboard, I installed that, then I upgraded to Unraid test build (and now RC-1) from 6.8 stable.
I significantly increased the overall system cooling to where the NVME drive is now about 10C cooler at idle in reaction to some temp warnings I was getting. Mobo and CPU are about 5C cooler (~29C at idle)
Finally, I spent a lot of time tracking down NTP errors and switched out some of the default google servers for the NTP pool for my region.

I really hope I'm not jinxing anything, but I was at 4 days of error-free uptime when I posted this. Hopefully something in here helps the next poor soul.

Crashing daily and have to cycle power just to get back to UI

Recommended Posts

randomjohn

Link to comment

JorgeB

Link to comment

randomjohn

Link to comment

JorgeB

Link to comment

randomjohn

Link to comment

JorgeB

Link to comment

randomjohn

Link to comment

JorgeB

Link to comment

Wingold

Link to comment

randomjohn

Link to comment

randomjohn

Link to comment

JorgeB

Link to comment

randomjohn

Link to comment

JorgeB

Link to comment

randomjohn

Link to comment

Join the conversation