Lime1028 Posted July 5, 2022 Share Posted July 5, 2022 I'm back again with the same problem that's been plaguing me for months, but now with more data. To state the issue plainly, my sever can't go more than about 10ish hours without crashing. I was more like 2 hours a few weeks ago but I've been able to stretch it out a bit longer now. Regardless this makes it completely unusable. Things I've tried: Two different motherboards (MSI B450 Tomahawk MAX, and ASRock B550 Pro4), with the newest BIOS one one and second newest on the other (For some reason I can't get Unraid to boot with the newest MSI BIOS). Three different sets of RAM, one of them being Multi-bit ECC RAM. Setting the "Typical Idle Power" setting in BIOS. Disabling C-States in BIOS. Adding the line "/usr/local/sbin/zenstates --c6-disable" to the "go" file in the config folder. The one thing that had worked that falsely lead me to believe I had solved the problem last time was completely swapping the system out for an ancient Intel based setup (old board/cpu/ram I had laying around). After having no luck with the first motherboard (MSI) and trying everything I could think of I figured the board was the issue and all I had on hand was an old Intel board and CPU to go along with it. I through it in and it worked. So the board was the problem and I RMAd it. Ran the server for a week on the Intel system with no issues. MSI didn't actually check the board out and just sent me a refurbished board in a beat up box. Rebuilding the system with the replacement board brought me back to square one. I got my hands on another AM4 board and stability is arguable a bit better, but what has really made the difference so far is 4 and 5 on the above list, which has increased the time before it crashes from between 15 mins to 2 hours, up to about 9 or 10 hours. However It's still crashing, and seeing as a parity check takes me about 26 hours, I can't even get through that. At this point I have a couple of questions. Any idea what might be causing this? Is my CPU the problem (Ryzen 3600)? If it's possible, is there any way I can go about testing it? Recommendations for RAM and motherboards that might solve my issue? (official compatibility page seems many years out of date) I wasn't able to log the most recent crash with the ECC RAM, but I have the previous crash which was just with standard DDR4. Logs and Diagnostics attached. Thanks for any advice, help, or for just generally taking the time to read this. tower-diagnostics-20220703-1622.zip tower-syslog-20220703-2021.zip Quote Link to comment
Darksurf Posted July 6, 2022 Share Posted July 6, 2022 (edited) I'm running a Ryzen Threadripper 3970X with 128G UECC memory in an ASROCK Creator TRX40 board, latest Beta BIOS, no stability issues. This could be various issues. 1. Are you updated to the latest BIOS version? 2. do you have fTPM disabled or enabled? If enabled, you'll want the latest BIOS update that fixes an fTPM stuttering issue. https://www.amd.com/en/support/kb/faq/pa-410 3. What speed are you running your Unbuffered ECC Memory at? Don't expect greater than 2933mhz for ECC memory on Ryzen 3XXX or lower. Some only work at 2666mhz. 4. If your Memory speeds aren't the problem, check your memory timings. There can be multiple jedec settings for timings or none requiring you to enter them manually to spec. 5. In the BIOS have you disabled all power saving nonsense such as suspend to RAM, aggressive ASPM, ALPM, etc. (I've found aggressive power management implementation in my old supermicro server board was a problem for my HDDs) 6. If you've done all the above, is your motherboard auto overclocking the CPU or RAM? disable auto-overclocking. As for specifics, I need to know the exact hardware in the build including the memory being used and what clock speeds and timings its rated for and what you have configured. Your logs here show normal gskill memory (non-ecc) and its running at the wrong speed and voltage (F4-3600C16-8GVKC running at 2133mhz and 1.2V). I also hope you're using UDIMM and not RDIMM ECC as RDIMM shouldn't work at all. Getting SMBIOS data from sysfs. SMBIOS 3.3.0 present. Handle 0x0018, DMI type 17, 92 bytes Memory Device Array Handle: 0x0010 Error Information Handle: 0x0017 Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: Unknown Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL A Type: Unknown Type Detail: Unknown Speed: Unknown Manufacturer: Unknown Serial Number: Unknown Asset Tag: Not Specified Part Number: Unknown Rank: Unknown Configured Memory Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Memory Technology: Unknown Memory Operating Mode Capability: Unknown Firmware Version: Unknown Module Manufacturer ID: Unknown Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: None Cache Size: None Logical Size: None Handle 0x001A, DMI type 17, 92 bytes Memory Device Array Handle: 0x0010 Error Information Handle: 0x0019 Total Width: 64 bits Data Width: 64 bits Size: 8 GB Form Factor: DIMM Set: None Locator: DIMM 1 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Unbuffered (Unregistered) Speed: 2133 MT/s Manufacturer: Unknown Serial Number: 00000000 Asset Tag: Not Specified Part Number: F4-3600C16-8GVKC Rank: 1 Configured Memory Speed: 2133 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 5, Hex 0xCD Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 8 GB Cache Size: None Logical Size: None Handle 0x001D, DMI type 17, 92 bytes Memory Device Array Handle: 0x0010 Error Information Handle: 0x001C Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: Unknown Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL B Type: Unknown Type Detail: Unknown Speed: Unknown Manufacturer: Unknown Serial Number: Unknown Asset Tag: Not Specified Part Number: Unknown Rank: Unknown Configured Memory Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Memory Technology: Unknown Memory Operating Mode Capability: Unknown Firmware Version: Unknown Module Manufacturer ID: Unknown Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: None Cache Size: None Logical Size: None Handle 0x001F, DMI type 17, 92 bytes Memory Device Array Handle: 0x0010 Error Information Handle: 0x001E Total Width: 64 bits Data Width: 64 bits Size: 8 GB Form Factor: DIMM Set: None Locator: DIMM 1 Bank Locator: P0 CHANNEL B Type: DDR4 Type Detail: Synchronous Unbuffered (Unregistered) Speed: 2133 MT/s Manufacturer: Unknown Serial Number: 00000000 Asset Tag: Not Specified Part Number: F4-3600C16-8GVKC Rank: 1 Configured Memory Speed: 2133 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 5, Hex 0xCD Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 8 GB Cache Size: None Logical Size: None Edited July 6, 2022 by Darksurf Quote Link to comment
Lime1028 Posted July 7, 2022 Author Share Posted July 7, 2022 Thanks for the reply. I'm running the newest BIOS. I've tried with fTPM disabled and enabled. RAM is running at it's default 2666 MHz. Ultimately I now think the CPU is the issue. I managed to get it stable enough to run prime95 for about 10 hours and it looks like core 4 has some serious issues. Every time it's always core 4 failing. I only had one day left on my warranty so I decided to RMA it, let's hope that fixes the problems. 1 Quote Link to comment
klipp01 Posted July 7, 2022 Share Posted July 7, 2022 I'm back again with the same problem that's been plaguing me for months, but now with more data. To state the issue plainly, my sever can't go more than about 10ish hours without crashing. I was more like 2 hours a few weeks ago but I've been able to stretch it out a bit longer now. Regardless this makes it completely unusable. Things I've tried: Two different motherboards (MSI B450 Tomahawk MAX, and ASRock B550 Pro4), with the newest BIOS one one and second newest on the other (For some reason I can't get Unraid to boot with the newest MSI BIOS). Three different sets of RAM, one of them being Multi-bit ECC RAM. Setting the "Typical Idle Power" setting in BIOS. Disabling C-States in BIOS. Adding the line "/usr/local/sbin/zenstates --c6-disable" to the "go" file in the config folder. The one thing that had worked that falsely lead me to believe I had solved the problem last time was completely swapping the system out for an ancient Intel based setup (old board/cpu/ram I had laying around). After having no luck with the first motherboard (MSI) and trying everything I could think of I figured the board was the issue and all I had on hand was an old Intel board and CPU to go along with it. I through it in and it worked. So the board was the problem and I RMAd it. Ran the server for a week on the Intel system with no issues. MSI didn't actually check the board out and just sent me a refurbished board in a beat up box. Rebuilding the system with the replacement board brought me back to square one. I got my hands on another AM4 board and stability is arguable a bit better, but what has really made the difference so far is 4 and 5 on the above list, which has increased the time before it crashes from between 15 mins to 2 hours, up to about 9 or 10 hours. However It's still crashing, and seeing as a parity check takes me about 26 hours, I can't even get through that. At this point I have a couple of questions. Any idea what might be causing this? Is my CPU the problem (Ryzen 3600)? If it's possible, is there any way I can go about testing it? Recommendations for RAM and motherboards that might solve my issue? (official compatibility page seems many years out of date) I wasn't able to log the most recent crash with the ECC RAM, but I have the previous crash which was just with standard DDR4. Logs and Diagnostics attached. Thanks for any advice, help, or for just generally taking the time to read this.tower-diagnostics-20220703-1622.zip tower-syslog-20220703-2021.zipI have a Ryzen 5 3600 with a MSI X570-A PRO (MS-7C37), Version 3.0 Motherboard and 32 GB of Corsair Vengeance LPX CMK16GX4M2B3200C16 ram.Been running for 2+ years and never a crash.Sent from my SM-G998U using Tapatalk 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.