Instability on Ryzen Based System


Recommended Posts

I'm back again with the same problem that's been plaguing me for months, but now with more data.

 

To state the issue plainly, my sever can't go more than about 10ish hours without crashing. I was more like 2 hours a few weeks ago but I've been able to stretch it out a bit longer now. Regardless this makes it completely unusable.

 

Things I've tried:

  1. Two different motherboards (MSI B450 Tomahawk MAX, and ASRock B550 Pro4), with the newest BIOS one one and second newest on the other (For some reason I can't get Unraid to boot with the newest MSI BIOS).
  2. Three different sets of RAM, one of them being Multi-bit ECC RAM.
  3. Setting the "Typical Idle Power" setting in BIOS.
  4. Disabling C-States in BIOS.
  5. Adding the line "/usr/local/sbin/zenstates --c6-disable" to the "go" file in the config folder.

 

The one thing that had worked that falsely lead me to believe I had solved the problem last time was completely swapping the system out for an ancient Intel based setup (old board/cpu/ram I had laying around). After having no luck with the first motherboard (MSI) and trying everything I could think of I figured the board was the issue and all I had on hand was an old Intel board and CPU to go along with it. I through it in and it worked. So the board was the problem and I RMAd it. Ran the server for a week on the Intel system with no issues. MSI didn't actually check the board out and just sent me a refurbished board in a beat up box. Rebuilding the system with the replacement board brought me back to square one. I got my hands on another AM4 board and stability is arguable a bit better, but what has really made the difference so far is 4 and 5 on the above list, which has increased the time before it crashes from between 15 mins to 2 hours, up to about 9 or 10 hours. However It's still crashing, and seeing as a parity check takes me about 26 hours, I can't even get through that.

 

At this point I have a couple of questions.

  1. Any idea what might be causing this?
  2. Is my CPU the problem (Ryzen 3600)? If it's possible, is there any way I can go about testing it?
  3. Recommendations for RAM and motherboards that might solve my issue? (official compatibility page seems many years out of date)

 

I wasn't able to log the most recent crash with the ECC RAM, but I have the previous crash which was just with standard DDR4. Logs and Diagnostics attached.

 

Thanks for any advice, help, or for just generally taking the time to read this.

tower-diagnostics-20220703-1622.zip tower-syslog-20220703-2021.zip

Link to comment

I'm running a Ryzen Threadripper 3970X with 128G UECC memory in an ASROCK Creator TRX40 board, latest Beta BIOS, no stability issues. This could be various issues.

1. Are you updated to the latest BIOS version?

2. do you have fTPM disabled or enabled? If enabled, you'll want the latest BIOS update that fixes an fTPM stuttering issue. https://www.amd.com/en/support/kb/faq/pa-410

3. What speed are you running your Unbuffered ECC Memory at? Don't expect greater than 2933mhz for ECC memory on Ryzen 3XXX or lower. Some only work at 2666mhz.

4. If your Memory speeds aren't the problem, check your memory timings. There can be multiple jedec settings for timings or none requiring you to enter them manually to spec.

5. In the BIOS have you disabled all power saving nonsense such as suspend to RAM, aggressive ASPM, ALPM, etc. (I've found aggressive power management implementation in my old supermicro server board was a problem for my HDDs)

6. If you've done all the above, is your motherboard auto overclocking the CPU or RAM? disable auto-overclocking.

 

 

As for specifics, I need to know the exact hardware in the build including the memory being used and what clock speeds and timings its rated for and what you have configured.

Your logs here show normal gskill memory (non-ecc) and its running at the wrong speed and voltage (F4-3600C16-8GVKC running at 2133mhz and 1.2V). I also hope you're using UDIMM and not RDIMM ECC as RDIMM shouldn't work at all.

Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0018, DMI type 17, 92 bytes
Memory Device
    Array Handle: 0x0010
    Error Information Handle: 0x0017
    Total Width: Unknown
    Data Width: Unknown
    Size: No Module Installed
    Form Factor: Unknown
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL A
    Type: Unknown
    Type Detail: Unknown
    Speed: Unknown
    Manufacturer: Unknown
    Serial Number: Unknown
    Asset Tag: Not Specified
    Part Number: Unknown
    Rank: Unknown
    Configured Memory Speed: Unknown
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown
    Memory Technology: Unknown
    Memory Operating Mode Capability: Unknown
    Firmware Version: Unknown
    Module Manufacturer ID: Unknown
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: None
    Cache Size: None
    Logical Size: None

Handle 0x001A, DMI type 17, 92 bytes
Memory Device
    Array Handle: 0x0010
    Error Information Handle: 0x0019
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 1
    Bank Locator: P0 CHANNEL A
    Type: DDR4
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 2133 MT/s
    Manufacturer: Unknown
    Serial Number: 00000000
    Asset Tag: Not Specified
    Part Number: F4-3600C16-8GVKC
    Rank: 1
    Configured Memory Speed: 2133 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 5, Hex 0xCD
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 8 GB
    Cache Size: None
    Logical Size: None

Handle 0x001D, DMI type 17, 92 bytes
Memory Device
    Array Handle: 0x0010
    Error Information Handle: 0x001C
    Total Width: Unknown
    Data Width: Unknown
    Size: No Module Installed
    Form Factor: Unknown
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL B
    Type: Unknown
    Type Detail: Unknown
    Speed: Unknown
    Manufacturer: Unknown
    Serial Number: Unknown
    Asset Tag: Not Specified
    Part Number: Unknown
    Rank: Unknown
    Configured Memory Speed: Unknown
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown
    Memory Technology: Unknown
    Memory Operating Mode Capability: Unknown
    Firmware Version: Unknown
    Module Manufacturer ID: Unknown
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: None
    Cache Size: None
    Logical Size: None

Handle 0x001F, DMI type 17, 92 bytes
Memory Device
    Array Handle: 0x0010
    Error Information Handle: 0x001E
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 1
    Bank Locator: P0 CHANNEL B
    Type: DDR4
    Type Detail: Synchronous Unbuffered (Unregistered)
    Speed: 2133 MT/s
    Manufacturer: Unknown
    Serial Number: 00000000
    Asset Tag: Not Specified
    Part Number: F4-3600C16-8GVKC
    Rank: 1
    Configured Memory Speed: 2133 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 5, Hex 0xCD
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 8 GB
    Cache Size: None
    Logical Size: None

 

Edited by Darksurf
Link to comment

Thanks for the reply.

 

I'm running the newest BIOS. I've tried with fTPM disabled and enabled. RAM is running at it's default 2666 MHz. 

 

Ultimately I now think the CPU is the issue. I managed to get it stable enough to run prime95 for about 10 hours and it looks like core 4 has some serious issues. Every time it's always core 4 failing. I only had one day left on my warranty so I decided to RMA it, let's hope that fixes the problems.

  • Upvote 1
Link to comment
I'm back again with the same problem that's been plaguing me for months, but now with more data.
 
To state the issue plainly, my sever can't go more than about 10ish hours without crashing. I was more like 2 hours a few weeks ago but I've been able to stretch it out a bit longer now. Regardless this makes it completely unusable.
 
Things I've tried:
  1. Two different motherboards (MSI B450 Tomahawk MAX, and ASRock B550 Pro4), with the newest BIOS one one and second newest on the other (For some reason I can't get Unraid to boot with the newest MSI BIOS).
  2. Three different sets of RAM, one of them being Multi-bit ECC RAM.
  3. Setting the "Typical Idle Power" setting in BIOS.
  4. Disabling C-States in BIOS.
  5. Adding the line "/usr/local/sbin/zenstates --c6-disable" to the "go" file in the config folder.
 
The one thing that had worked that falsely lead me to believe I had solved the problem last time was completely swapping the system out for an ancient Intel based setup (old board/cpu/ram I had laying around). After having no luck with the first motherboard (MSI) and trying everything I could think of I figured the board was the issue and all I had on hand was an old Intel board and CPU to go along with it. I through it in and it worked. So the board was the problem and I RMAd it. Ran the server for a week on the Intel system with no issues. MSI didn't actually check the board out and just sent me a refurbished board in a beat up box. Rebuilding the system with the replacement board brought me back to square one. I got my hands on another AM4 board and stability is arguable a bit better, but what has really made the difference so far is 4 and 5 on the above list, which has increased the time before it crashes from between 15 mins to 2 hours, up to about 9 or 10 hours. However It's still crashing, and seeing as a parity check takes me about 26 hours, I can't even get through that.
 
At this point I have a couple of questions.
  1. Any idea what might be causing this?
  2. Is my CPU the problem (Ryzen 3600)? If it's possible, is there any way I can go about testing it?
  3. Recommendations for RAM and motherboards that might solve my issue? (official compatibility page seems many years out of date)
 
I wasn't able to log the most recent crash with the ECC RAM, but I have the previous crash which was just with standard DDR4. Logs and Diagnostics attached.
 
Thanks for any advice, help, or for just generally taking the time to read this.
tower-diagnostics-20220703-1622.zip tower-syslog-20220703-2021.zip
I have a Ryzen 5 3600 with a MSI X570-A PRO (MS-7C37), Version 3.0 Motherboard and 32 GB of Corsair Vengeance LPX CMK16GX4M2B3200C16 ram.

Been running for 2+ years and never a crash.



Sent from my SM-G998U using Tapatalk

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.