Crashing daily and have to cycle power just to get back to UI


Recommended Posts

v6.8.3 (registered)

Plugins: Fix common problems, unassigned devices, CA Backup / Restore Appdata, Community Applications, Dynamix S3 Sleep, Dynamix SSD TRIM, Dynamix System Buttons, Dynamix System Information, Dynamix System Statistics, Dynamix System Temperature, Nerd Tools, Unassigned Devices Plus

 

Dockers: binhex-sabnzbd, plexinc plex media server, linuxserver radarr/sonarr/lidarr

 

Hardware:

Model: Custom

M/B: ASUSTeK COMPUTER INC. PRIME Z390-A Version Rev 1.xx - s/n: 200873513103219

BIOS: American Megatrends Inc. Version 1602. Dated: 06/04/2020

CPU: Intel® Core™ i9-9900K CPU @ 3.60GHz

HVM: Enabled

IOMMU: Enabled

Cache: 512 KiB, 2048 KiB, 16384 KiB

Memory: 32 GiB DDR4 (max. installable capacity 64 GiB)

Network: bond0: fault-tolerance (active-backup), mtu 1500
 eth0: 1000 Mbps, full duplex, mtu 1500

Kernel: Linux 4.19.107-Unraid x86_64

 

Diagnostics attached. Thanks in advance for your help.

 

tower-diagnostics-20201128-1928.zip

Link to comment
4 hours ago, randomjohn said:

Could that be part of the problem?

Memtest only works if you boot in legacy mode, it won't in UEFI mode, nothing on the syslog that I can see, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

OK thanks for that. I'm now getting an MCE warning from fix common problems. Any chance you see something in the attached, or is there a specific MCE area?

 

(The machine is headless and I haven't moved it back to a monitor yet so unless there's an option to boot to safe mode from the GUI, I haven't been able to do that yet - I stupidly put it away after running memtest)

tower-diagnostics-20201201-1205.zip

Link to comment

OK, I'm trying to sift through the syslog now. I did waste a lot of space with csrf_token errors from having browser sessions open across a reboot. This is the only thing I've found that looks like it might be on point and I can't make heads or tails of it:

 

Dec 1 12:47:07 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server

Dec 1 12:47:07 Tower root: Hardware event. This is not a software error.

Dec 1 12:47:07 Tower root: MCE 0

Dec 1 12:47:07 Tower root: CPU 0 BANK 0 TSC 23a4c0e61fe6

Dec 1 12:47:07 Tower root: ADDR 1ffff8107baa3

Dec 1 12:47:07 Tower root: TIME 1606800346 Tue Dec 1 00:25:46 2020

Dec 1 12:47:07 Tower root: MCG status:

Dec 1 12:47:07 Tower root: MCi status:

Dec 1 12:47:07 Tower root: Corrected error

Dec 1 12:47:07 Tower root: Error enabled

Dec 1 12:47:07 Tower root: MCi_ADDR register valid

Dec 1 12:47:07 Tower root: MCA: Instruction CACHE Level-0 Instruction-Fetch Error

Dec 1 12:47:07 Tower root: STATUS 9400004000040150 MCGSTATUS 0

Dec 1 12:47:07 Tower root: MCGCAP c0e APICID 0 SOCKETID 0

Dec 1 12:47:07 Tower root: MICROCODE d6

Dec 1 12:47:07 Tower root: CPUID Vendor Intel Family 6 Model 158

Link to comment
On 12/2/2020 at 2:52 AM, JorgeB said:

Start by running memtest.

I ran both Memtest86+ 5.01 (thanks for your heads up to boot in legacy mode) and MemTest 86 v0.4. They each ran for at least four passes and about 20 additional hours and returned no errors.

 

I've been running error-free in Safe Mode for ~24 hours now. Will be rather annoyed (but relieved) if it's the same problem @Wingold refers to - at least I'll know, but a significant part of the hardware purchase and Unraid installation was to up the horsepower for Plex streaming.

 

Link to comment

Any of this look interesting or relevant? Still trying to figure out what's going on with the ntp server.

 

Dec  6 17:56:21 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Dec  6 17:56:21 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Dec  6 18:02:00 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Dec  6 18:13:37 Tower kernel: mce: [Hardware Error]: Machine check events logged
Dec  6 18:19:58 Tower kernel: mce: [Hardware Error]: Machine check events logged
Dec  6 18:25:50 Tower kernel: python3[9144]: segfault at 689b401a4b3f ip 000014ebe5c85646 sp 000014ebd31f7160 error 4 in libpython3.8.so.1.0[14ebe5bc0000+1e7000]
Dec  6 18:25:50 Tower kernel: Code: f8 41 89 f1 42 ff 24 f1 0f 1f 40 00 4c 89 ee 4c 89 3e 4c 8d 6e 08 4d 85 ff 0f 84 62 e8 ff ff 44 8b b3 50 02 00 00 45 85 f6 0f <85> 6c b7 ff ff 44 8b 9b 4c 02 00 00 4c 89 d7 48 2b 3c 24 45 85 db

Link to comment

That's what I was afraid of. Any recommendations for good diagnostics? Weird that it would start after two months. Usually it's right away or after a long time. I did add and move fans, which significantly dropped internal temps but maybe that was too late.

 

I've attached the most-recent diagnostics, but I don't really know what I'm looking for and nothing leapt out at me.

 

Thanks for all your help.

tower-diagnostics-20201206-2253.zip

Edited by randomjohn
Link to comment

I guess we can consider this one closed but not solved.

 

  • While it ran properly before a BIOS update came out recently for my motherboard, I installed that, then I upgraded to Unraid test build (and now RC-1) from 6.8 stable.
  • I significantly increased the overall system cooling to where the NVME drive is now about 10C cooler at idle in reaction to some temp warnings I was getting. Mobo and CPU are about 5C cooler (~29C at idle)
  • Finally, I spent a lot of time tracking down NTP errors and switched out some of the default google servers for the NTP pool for my region.

 

I really hope I'm not jinxing anything, but I was at 4 days of error-free uptime when I posted this. Hopefully something in here helps the next poor soul.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.