randomjohn Posted November 29, 2020 Share Posted November 29, 2020 v6.8.3 (registered) Plugins: Fix common problems, unassigned devices, CA Backup / Restore Appdata, Community Applications, Dynamix S3 Sleep, Dynamix SSD TRIM, Dynamix System Buttons, Dynamix System Information, Dynamix System Statistics, Dynamix System Temperature, Nerd Tools, Unassigned Devices Plus Dockers: binhex-sabnzbd, plexinc plex media server, linuxserver radarr/sonarr/lidarr Hardware: Model: Custom M/B: ASUSTeK COMPUTER INC. PRIME Z390-A Version Rev 1.xx - s/n: 200873513103219 BIOS: American Megatrends Inc. Version 1602. Dated: 06/04/2020 CPU: Intel® Core™ i9-9900K CPU @ 3.60GHz HVM: Enabled IOMMU: Enabled Cache: 512 KiB, 2048 KiB, 16384 KiB Memory: 32 GiB DDR4 (max. installable capacity 64 GiB) Network: bond0: fault-tolerance (active-backup), mtu 1500 eth0: 1000 Mbps, full duplex, mtu 1500 Kernel: Linux 4.19.107-Unraid x86_64 Diagnostics attached. Thanks in advance for your help. tower-diagnostics-20201128-1928.zip Quote Link to comment
JorgeB Posted November 29, 2020 Share Posted November 29, 2020 Try this and post that log after a crash, if you haven't yet also a good idea to run memtest. Quote Link to comment
randomjohn Posted November 30, 2020 Author Share Posted November 30, 2020 Not sure if the syslog worked, but I've attached it. MemTest looks good - although I couldn't run it from the Unraid boot - I had to go to the website and create a new USB with just that on it to run MemTest. Could that be part of the problem? MemTest86-Report-20201129-111840.html syslog Quote Link to comment
JorgeB Posted November 30, 2020 Share Posted November 30, 2020 4 hours ago, randomjohn said: Could that be part of the problem? Memtest only works if you boot in legacy mode, it won't in UEFI mode, nothing on the syslog that I can see, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
randomjohn Posted December 1, 2020 Author Share Posted December 1, 2020 OK thanks for that. I'm now getting an MCE warning from fix common problems. Any chance you see something in the attached, or is there a specific MCE area? (The machine is headless and I haven't moved it back to a monitor yet so unless there's an option to boot to safe mode from the GUI, I haven't been able to do that yet - I stupidly put it away after running memtest) tower-diagnostics-20201201-1205.zip Quote Link to comment
JorgeB Posted December 1, 2020 Share Posted December 1, 2020 IIRC they show up in the syslog with mcelog installed, but in any case it points to a hardware problem. Quote Link to comment
randomjohn Posted December 1, 2020 Author Share Posted December 1, 2020 OK, I'm trying to sift through the syslog now. I did waste a lot of space with csrf_token errors from having browser sessions open across a reboot. This is the only thing I've found that looks like it might be on point and I can't make heads or tails of it: Dec 1 12:47:07 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server Dec 1 12:47:07 Tower root: Hardware event. This is not a software error. Dec 1 12:47:07 Tower root: MCE 0 Dec 1 12:47:07 Tower root: CPU 0 BANK 0 TSC 23a4c0e61fe6 Dec 1 12:47:07 Tower root: ADDR 1ffff8107baa3 Dec 1 12:47:07 Tower root: TIME 1606800346 Tue Dec 1 00:25:46 2020 Dec 1 12:47:07 Tower root: MCG status: Dec 1 12:47:07 Tower root: MCi status: Dec 1 12:47:07 Tower root: Corrected error Dec 1 12:47:07 Tower root: Error enabled Dec 1 12:47:07 Tower root: MCi_ADDR register valid Dec 1 12:47:07 Tower root: MCA: Instruction CACHE Level-0 Instruction-Fetch Error Dec 1 12:47:07 Tower root: STATUS 9400004000040150 MCGSTATUS 0 Dec 1 12:47:07 Tower root: MCGCAP c0e APICID 0 SOCKETID 0 Dec 1 12:47:07 Tower root: MICROCODE d6 Dec 1 12:47:07 Tower root: CPUID Vendor Intel Family 6 Model 158 Quote Link to comment
JorgeB Posted December 2, 2020 Share Posted December 2, 2020 Start by running memtest. Quote Link to comment
Wingold Posted December 2, 2020 Share Posted December 2, 2020 Hello randomjohn at the Moment i have the same Problem in this Post the same Problem was caused by the Plex Container an his Maintenance run maybe you can disable the Docker temporaly and test if there are further crashes. I did this today too Quote Link to comment
randomjohn Posted December 5, 2020 Author Share Posted December 5, 2020 On 12/2/2020 at 2:52 AM, JorgeB said: Start by running memtest. I ran both Memtest86+ 5.01 (thanks for your heads up to boot in legacy mode) and MemTest 86 v0.4. They each ran for at least four passes and about 20 additional hours and returned no errors. I've been running error-free in Safe Mode for ~24 hours now. Will be rather annoyed (but relieved) if it's the same problem @Wingold refers to - at least I'll know, but a significant part of the hardware purchase and Unraid installation was to up the horsepower for Plex streaming. Quote Link to comment
randomjohn Posted December 7, 2020 Author Share Posted December 7, 2020 Any of this look interesting or relevant? Still trying to figure out what's going on with the ntp server. Dec 6 17:56:21 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized Dec 6 17:56:21 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized Dec 6 18:02:00 Tower ntpd[30258]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized Dec 6 18:13:37 Tower kernel: mce: [Hardware Error]: Machine check events logged Dec 6 18:19:58 Tower kernel: mce: [Hardware Error]: Machine check events logged Dec 6 18:25:50 Tower kernel: python3[9144]: segfault at 689b401a4b3f ip 000014ebe5c85646 sp 000014ebd31f7160 error 4 in libpython3.8.so.1.0[14ebe5bc0000+1e7000] Dec 6 18:25:50 Tower kernel: Code: f8 41 89 f1 42 ff 24 f1 0f 1f 40 00 4c 89 ee 4c 89 3e 4c 8d 6e 08 4d 85 ff 0f 84 62 e8 ff ff 44 8b b3 50 02 00 00 45 85 f6 0f <85> 6c b7 ff ff 44 8b 9b 4c 02 00 00 4c 89 d7 48 2b 3c 24 45 85 db Quote Link to comment
JorgeB Posted December 7, 2020 Share Posted December 7, 2020 8 hours ago, randomjohn said: Any of this look interesting or relevant? They point to a hardware problem, but don't identify the component. Quote Link to comment
randomjohn Posted December 7, 2020 Author Share Posted December 7, 2020 (edited) That's what I was afraid of. Any recommendations for good diagnostics? Weird that it would start after two months. Usually it's right away or after a long time. I did add and move fans, which significantly dropped internal temps but maybe that was too late. I've attached the most-recent diagnostics, but I don't really know what I'm looking for and nothing leapt out at me. Thanks for all your help. tower-diagnostics-20201206-2253.zip Edited December 7, 2020 by randomjohn Quote Link to comment
JorgeB Posted December 7, 2020 Share Posted December 7, 2020 Anything can go bad at any time, if the diags don't point to the culprit basically you'd need to start swapping components around to find it. Quote Link to comment
randomjohn Posted December 11, 2020 Author Share Posted December 11, 2020 I guess we can consider this one closed but not solved. While it ran properly before a BIOS update came out recently for my motherboard, I installed that, then I upgraded to Unraid test build (and now RC-1) from 6.8 stable. I significantly increased the overall system cooling to where the NVME drive is now about 10C cooler at idle in reaction to some temp warnings I was getting. Mobo and CPU are about 5C cooler (~29C at idle) Finally, I spent a lot of time tracking down NTP errors and switched out some of the default google servers for the NTP pool for my region. I really hope I'm not jinxing anything, but I was at 4 days of error-free uptime when I posted this. Hopefully something in here helps the next poor soul. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.