Various issues after hardware changes

dlchamp · January 14, 2020

My unRAID sever originally was running Dual Xeon E5-2660s, but after fighting with it's random freezes and crashes over a 2-3 month period, I decided it was enough and wanted to go with something a little newer and more efficient.

I got a great deal on a BNIB Ryzen 7 1800X + MSI X370 Pro Carbon mobo. I did undervolt this chip. I also ran Prime95 and Realbench in Windows for a few days, then P95 again in a separate unraid install for 2 days straight to make 100% certain this configuration was stable. And it was. I moved it into production and it ran for a bit over a week. I decided that, since I now have the option, I wanted to free up a SATA slot for another drive and move my cache to a large NVMe drive. This change was also to make it possible for me to send Sab and QBitt downloads to the cache freeing up the array IO during those download and unpacking sessions. This seemed to work fine for another couple days, but then I restarted my Plex container and it wouldn't start back up. Looked at logs and it was spamming "Starting Plex Media server". Then looked at my unRAID logs and it was spamming something about issues with my NVMe drive. I should have saved it, but I didn't have my l logs mirrored or writing to another location and reboot and lost them.

I did some searching and found that the issue potentially came from downloading torrrents to a BTFS formatted drive. I'm not familiar with BTFS or why it's really a problem, but I backed up what I could, then formatted the cache drive to XFS and restored everything back. This issue did cause me to have to remove all Plex appdata and completely reinstall that container because many of the .db files wre corrupted and I would still get the "Starting Plex Media Server" spam until I did so. Fine, no big deal. But, after this happened, I started having my log file written to my appdata folder so that it would persist through a reboot if something else started happening. Which, it did. A few days later, I noticed my log was being spammed with:

Jan 12 06:45:02 Anton kernel: nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000da1b0000 flags=0x0000]

Jan 12 06:45:34 Anton kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x00000000fad4e000 flags=0x0000]

Jan 12 06:45:43 Anton kernel: amd_iommu_report_page_fault: 194 callbacks suppressed

There were multiple of each, but these spammed alternating for about 2 minutes straight.

As you can see this happened on the 12th and there wasn't a crash or any noticeable issues. Yesterday, the 13th, the server crashed. Nothing appears to have been written to the log about said crash, but on screen it said something about "Kernel panic - Shutting down CPUs with NMI"

This crash happened 3 times over the next few hours, but now it's been up for the last 12 hours.

I read something about IOMMU issues with Ryzen and NVMe drives, so I disabled it in the BIOS - unRAID still shows it's enabled in the WebGUI.

I do not use VMs.

I am running the Nvidia build of 6.8 by LS.io as I have a P400 being used with my Plex container. (patiently waiting for their 6.8.1 release.)

BIOS is the latest non beta.

Diag only shows my logs as of the last reboot so I am attaching the logs that I have been mirroring since Jan 6th.

Aside: I've seen a "clock unsynchronized" error a few times, but I don't know how to handle that. Time and date are correct in BIOS and unRAID.

anton-diagnostics-20200114-0824.zip syslog-192.168.99.253.log

JorgeB · January 14, 2020

Ryzen on Linux can lock up due to issues with c-states, make sure bios is up to date, then look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar), or completely disable C-sates.

More info here:
https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/

testdasi · January 14, 2020

5 minutes ago, dlchamp said:

... I did undervolt this chip...

Why?

Instability is very hard to troubleshoot with any kind of running a CPU outside of its default range of parameters.

dlchamp · January 14, 2020

8 minutes ago, testdasi said:

Why?

Instability is very hard to troubleshoot with any kind of running a CPU outside of its default range of parameters.

Because the server is in a closet with less than stellar airflow and my dual Xeons would make the room the closet was in extra toasty. I have a lot more experience with Ryzen CPUs and making this chip run cooler overall was a simple task. It's not the cause for instability, that I can guarantee.

I can reset everything to default just to rule out it as a possibility, but I really think the issue is elsewhere.

Edited January 14, 2020 by dlchamp
Adding more info

dlchamp · January 14, 2020

8 minutes ago, johnnie.black said:

Ryzen on Linux can lock up due to issues with c-states, make sure bios is up to date, then look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar), or completely disable C-sates.

More info here:
https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/

I did read this. I did disable C-states in the BIOS. I do recall seeing another setting. I'll give it a try, I just don't fully understand why a week of stability, then I swap to nvme and now issues begin.

dlchamp · January 15, 2020

22 hours ago, johnnie.black said:

Ryzen on Linux can lock up due to issues with c-states, make sure bios is up to date, then look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar), or completely disable C-sates.

More info here:
https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/

I verified last night that C-states were disabled, but the Power Supply Idle Control was left to auto. I went ahead and set that as well. I also loaded BIOS defaults before setting these to remove my undervolt to rule out that as an issue.

dlchamp · January 15, 2020

22 hours ago, testdasi said:

Why?

Instability is very hard to troubleshoot with any kind of running a CPU outside of its default range of parameters.

I verified last night that C-states were disabled, but the Power Supply Idle Control was left to auto. I went ahead and set that as well. I also loaded BIOS defaults before setting these to remove my undervolt to rule out that as an issue.

I'll update here if/when there is another crash.

dlchamp · January 15, 2020

Another crash this morning. Nothing in the logs, but I did grab a picture of the monitor when it happened.

This morning after I made the changes and rebooted, I got this error on screen during the reboot process. After rebooting, it started up normally, but then crashed maybe 30 minutes later. When I went home for lunch to check it out I saw this error. Rebooted and it appears to be running, but I'm VPN'd in and monitoring it as much as I can.

Edited January 15, 2020 by dlchamp

jonp · January 16, 2020

Hi there,

I saw your email into our support team and wanted to take a moment to reply. After reviewing the thread and diagnostics (as well as the error messages you posted most recently), it's pretty clear to me that this isn't so much a software bug as something being amiss with the hardware/configuration. The fact that you're having so many different issues/error messages is pretty good indication of this. If you were consistently getting the same error or gave us steps that could consistently reproduce an issue, there'd be something for us to investigate on the software side of things.

The first thing I would do is remove the NVMe drive and see if that returns the system to a stable state. If so, power/heat may be an issue and I would try returning the BIOS to defaults (except for disabling C-states as previously suggested) and again, try to recreate the problem.

Vr2Io · January 17, 2020

8 hours ago, jonp said:

The first thing I would do is remove the NVMe drive

Agree, due to AMD-Vi (IOMMU) error also point to NVMe

Jan 12 06:45:34 Anton kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0

Some feedback but not relate to the crash

- memory speed setting ( suppose be auto ) quite slow, only 1600MT/s, you have found 8G 3200MT/s single RANK RAM module. Expect could run at 2400MT/s or 2666MT/s.

- undervolt this chip. It may reduce heat during high load, but in fact, it will generate more heat in low load because the voltage were fixed.

Edited January 17, 2020 by Benson

Froberg · January 17, 2020

NVME without active cooling would be a concern and could be the cause of instability. Those bastards can get toasty really quickly.

dlchamp · January 17, 2020

9 hours ago, Froberg said:

NVME without active cooling would be a concern and could be the cause of instability. Those bastards can get toasty really quickly.

The motherboard has a built in heatsink for the NVMe which is installed.

dlchamp · January 17, 2020

11 hours ago, Benson said:

Agree, due to AMD-Vi (IOMMU) error also point to NVMe

Jan 12 06:45:34 Anton kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0

Some feedback but not relate to the crash

- memory speed setting ( suppose be auto ) quite slow, only 1600MT/s, you have found 8G 3200MT/s single RANK RAM module. Expect could run at 2400MT/s or 2666MT/s.

- undervolt this chip. It may reduce heat during high load, but in fact, it will generate more heat in low load because the voltage were fixed.

I removed the undervolt. BIOS has been set back to default after my initial post, which was followed by two crashes.

After removing the undervolt, I set the C-state to disabled, Set Power Supply idle control to Typical... (whatever it is), and then manually set the SOC voltage to 1.0 and set RAM back to 3200 by enabling the XMP profile. I did see it crash a couple more times after doing this, but it's now been up without issue

since Wed. afternoon, however, I did update to 6.8.1 this morning so it's been rebooted.

dlchamp · January 17, 2020

20 hours ago, jonp said:

Hi there,

I saw your email into our support team and wanted to take a moment to reply. After reviewing the thread and diagnostics (as well as the error messages you posted most recently), it's pretty clear to me that this isn't so much a software bug as something being amiss with the hardware/configuration. The fact that you're having so many different issues/error messages is pretty good indication of this. If you were consistently getting the same error or gave us steps that could consistently reproduce an issue, there'd be something for us to investigate on the software side of things.

The first thing I would do is remove the NVMe drive and see if that returns the system to a stable state. If so, power/heat may be an issue and I would try returning the BIOS to defaults (except for disabling C-states as previously suggested) and again, try to recreate the problem.

Thanks for getting back to me!

Since my final crash reported on Wednesday, the system has been stable and running. I did update to 6.8.1 this morning, but it's back up and running normal.

I'm continuing to monitor, but I'm going to start my rclone script that I've been using for backups to see if for some reason that forces a crash as that is the only thing missing since the crash on Wednesday.

JorgeB · January 17, 2020

1 minute ago, dlchamp said:

and set RAM back to 3200 by enabling the XMP profile.

You shouldn't overclock the RAM, some Ryzen system are known to be unstable/corrupt data with overclocked RAM, respect max supported speeds depending on config:

38207429_1stgen.png.8e7444bce926d81ed9b52df9f0097302.png

dlchamp · January 17, 2020

5 minutes ago, johnnie.black said:

You shouldn't overclock the RAM, some Ryzen system are known to be unstable/corrupt data with overclocked RAM, respect max supported speeds depending on config:

Is that a LInux thing?

Running XMP profiles is pretty common. That very RAM came out of my gaming system that I ran at 3200 for the last year or so. I'll set it to auto if I get another crash. As of now, it's stable.

dlchamp · January 17, 2020

12 hours ago, Benson said:

Agree, due to AMD-Vi (IOMMU) error also point to NVMe

Jan 12 06:45:34 Anton kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0

Some feedback but not relate to the crash

- memory speed setting ( suppose be auto ) quite slow, only 1600MT/s, you have found 8G 3200MT/s single RANK RAM module. Expect could run at 2400MT/s or 2666MT/s.

- undervolt this chip. It may reduce heat during high load, but in fact, it will generate more heat in low load because the voltage were fixed.

Thing is though, I have IOMMU disabled in the BIOS.

I read that it was potentially a problem having it enabled with certain NVMe controllers, so I did disable it, but unRAID is still seeing it as enabled. Though, HVM is showing as disabled, but isn't disabled in the BIOS.

Edited January 17, 2020 by dlchamp

JorgeB · January 17, 2020

1 minute ago, dlchamp said:

Is that a LInux thing?

No, OS independent, though it's more easy to notice in Unraid during a parity check, e.g.:

JorgeB · January 17, 2020

Running XMP profiles is pretty common.

Also, it's pretty common but it's still an overclock, servers and overclock don't usually go well together, at least it's not something I would ever consider doing, but data integrity is important for me, and if it keeps crashing you should at least test for a few days without any overclock.

Squid · January 17, 2020

8 minutes ago, dlchamp said:

Running XMP profiles is pretty common.

Yes it is. And unfortunately all the manufacturers downplay the fact that it's an overclock

8 minutes ago, dlchamp said:

That very RAM came out of my gaming system that I ran at 3200

Not saying this is the case, but it is rather curious how many times people wind up rebooting a Windows box because "various weird things" happen that aren't necessary fatal. And yet Windows is a rock solid platform that doesn't actually crash.

dlchamp · January 17, 2020

2 minutes ago, johnnie.black said:

Also, it's pretty common but it's still an overclock, servers and overclock don't usually go well together, at least it's not something I would ever consider doing, but data integrity is important for me, and if it keeps crashing you should at least test for a few days without any overclock.

For sure. I removed the undervolt, but XMP is still enabled. I will set it to auto if another crash happens.

I ran a parity check after the migration to the new hardware, so the undervolt and RAM overclock were applied, but of course, that doesn't mean it's not a problem.

dlchamp · January 17, 2020

2 minutes ago, Squid said:

Yes it is. And unfortunately all the manufacturers downplay the fact that it's an overclock

Not saying this is the case, but it is rather curious how many times people wind up rebooting a Windows box because "various weird things" happen that aren't necessary fatal. And yet Windows is a rock solid platform that doesn't actually crash.

Well, that is a big difference here. My WIndows machine gets shutdown every night when I start getting ready for bed. This machine stays on 24/7 or as close to at as possible.

As I told the others, I will definitely remove the XMP profile if it decides to crash again.

dlchamp · January 20, 2020

I had another crash early Saturday night. I set the RAM back to auto and disabled XMP. It's back up and running, but here is the log that was written right before the crash. I've noticed the AMD-VI error I was getting before hasn't appeared since that one day for the 2 minutes it was being spammed.

Pastebin - Log from 1/18, right before the crash happened.

dlchamp · January 21, 2020

On 1/17/2020 at 11:02 AM, Squid said:

Yes it is. And unfortunately all the manufacturers downplay the fact that it's an overclock

Not saying this is the case, but it is rather curious how many times people wind up rebooting a Windows box because "various weird things" happen that aren't necessary fatal. And yet Windows is a rock solid platform that doesn't actually crash.

I saw your post here

And I just saw the same thing. I was spammed 39 emails within the last hour. Then, the server crashed, but I'm not home to be able to take a look at logs.

/bin/sh: line 1: 15204 Bus error  /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null	
/bin/sh: line 1: 17382 Bus error /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
/bin/sh: line 1: 18658 Bus error /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
/bin/sh: line 1: 19841 Bus error /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
/bin/sh: line 1: 20412 Bus error /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
...
...

This is definitely a first.

Edit: Server has been down for roughly 45 minutes, and I'm still getting these emails.

Edited January 21, 2020 by dlchamp

Various issues after hardware changes

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation