D'n'S137 Posted December 4, 2020 Share Posted December 4, 2020 (edited) Hello everyone, Im Runnig Unraid 6.8.3 on an AMD RYZEN 7 1700 with 4x2TB HARD Drives(1Parity) and 2xSSDs(for Cache) 1SSD is 128GB and the other one is 256GB. For Memory i have 32GB of DDR3 2144MHz RAM non ECC. The Mainboard is a Gigabity B450 AORUS ELITE. (works like a charm) For the GPU i got an old EVGA GTX 980TI Hybrid and an MSI GTX 710. Plugins: Nerdpack ZFS(not in use) NVIDIA UNRAID(not in use) DenyHosts userScripts upnp-monitor fix-common-problems parity-check-tuning gui-links(not in use) dynamixsystemtemp dynamixCacheDirs dynamixActiveStreams community applications disableSecurityAMD turboCache moverTuning dynamixSSDtrim I had some problems in the past few day and checked many different ways. Now i am at some point where i have to talk to you. So my System was runnig great now for 85days. Some Weeks ago it started to randomly crash and become unresponsive. No WebGUI, no CLI, no reaction when using the keyboard and no reaction when pressing the Boot/Start button on the Chassis. So i did my research and figured out AMD processors have to be set in BIOS to "typical current idle", changed it and it ran for some days. Since these changes, the server randomly reboots. So i changed the CPU freqency tuning settings from "Auto" to a normal fixed Value. Then ist did work for some days and now it randomly reboots again. Since i had no other Idea and it is getting a little bit anoying. I hope someone can help me out at this point. Please see my Diagnostics attached. In the Syslogfile you will notice there is some Hardware Error with the CPU but i am not quite sure what the this means in detail. Greetings and stay healthy. mps-diagnostics-20200524-1440.zip Edited December 10, 2020 by D'n'S137 Some Corrections Quote Link to comment
D'n'S137 Posted December 9, 2020 Author Share Posted December 9, 2020 Dec 4 22:22:52 MPS kernel: mce: [Hardware Error]: Machine check events logged Dec 4 22:22:52 MPS kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108 Dec 4 22:22:52 MPS kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8114fef2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 Dec 4 22:22:52 MPS kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1607116951 SOCKET 0 APIC 2 microcode 8001138 Dec 4 22:22:52 MPS kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 Quote Link to comment
D'n'S137 Posted December 10, 2020 Author Share Posted December 10, 2020 I Backedup my data from the System(in SafeMode with GUI) to another Drive, to do so i had to take one of the CacheDrives out of my System, because otherwise i ran into an licence issue. I noticed since the hole copying process my system runs fine. So i gave it a try put the CacheDrive back in and rebooted the system. After like 1 and a half hour the system crashed again. I removed the CacheDrive again and rebooted into SafeMode with GUI started the array an let the system run. After 17 Hours the system crashed again right next to me. So it beeps twice and then reboots without warning or any singn of shutdown script. The screen goes black, it beeps twice and reboots. maybe a hardware error or an stability issue from the Mainboard config. I will do some further Testing: Run some Memtests Play with CPU voltage an clock in an attempt to stabilise it Play with MemoryClocks(maybe slowing it down?) Open Questions to me: Is it even possible that this is an GPU error? If yes how to test the GPU? Is there an benchmark software/OS to use for hardware tests? Maybe someone can help me with one of those questions. Kindly Greetings Quote Link to comment
Kevek79 Posted December 10, 2020 Share Posted December 10, 2020 I might not be of great help here but I will try to help you with what I would do in your situation First of all some questions: What Power Supply do you use in the system? Are you sure it works correctly and the power it delivers is sufficient? Is the RAM that you are using a set of 4 Sticks or did you have 2 Sticks from one Brand and added another pair form another Manufacturer? In your Log-Files I can see the Manufacturer of Channel B RAM but nothing for the Channel A RAM which made me curious. I am not a Guru in these matters but I would try a couple of things. 1. Remove any Overclocking if any is set in the BIOS incl. XMP Profiles for RAM - Test if the Error Persists 2. Run Memtest for at least 24h - Does it run through without errors? 3. Swap Power Supply with a replacement if availabel - Test if the Error Persists Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.