[SOLVED] Troubleshooting hard lock issue - Please help

Kreavan · April 12, 2020

I've been trying to fix intermittent hard locks on my threadripper build, but have not been successful. It has happened twice in the last two nights. There are some similarities I found in the syslog during the times right before it happens. I'm not too familiar with the entries though, and was hoping someone here might be able to look at them and know what it is related to.

The two syslog files are just the snippets of activity right before the hard locks occur.

Build details:
ASRock x399 Phantom Gaming 6 (BIOS v1.30)
AMD Threadripper 2950x
32GB Corsair Vengeance LPX DDR4 3200
9207-8i HBA SAS PCIe 3.0 adapter
Samsung 860 QVO 1tb (x2 in RAID 1 as cache drive)
Western Digital Red 8tb (x5) - Main Array
Western Digital Red 4tb (x2) - One as an Unassigned drive

Thanks!

syslog_4_11_2020.txt syslog_4_12_2020.txt

Dissones4U · April 12, 2020

36 minutes ago, Kreavan said:

The two syslog files are just the snippets

In general snippets will not suffice, you should post your full diagnostic, include any hardware or software changes (including physically moving the box) within four weeks of the current issue starting. Also include any steps you've taken to troubleshoot your rig including running memtest or turning off and restarting each instance of the following one at a time:

dockers
vms
plugins

The thread below provides lots of information on where to start. Aside from that I've linked to the FAQ regarding Ryzen based servers.

Have you read the FAQ - What can I do to keep my Ryzen based server from crashing/locking up with Unraid?

Edited April 12, 2020 by Dissones4U

Kreavan · April 12, 2020

Here are more details as requested. My Unraid build is new (still on trial version for 8 more days). This hard lock issue has been happening pretty much since day one intermittently. There have even been a few instances where my netgear gigabit switch would die along with it, which was strange. Forgot to add in my hardware build, I have an Nvidia p2000 gpu. No consistency in the crashes. Sometimes it will go 5 days without an issues, and sometimes only 12-15hrs.

I have the following dockers installed, and the issue has happened even with all of them stopped.

- Plex

- Binhex Delugevpn

- Binhex Sonarr

- Binhex Radarr

- Binhex Krusader

- Jackett

Plugins installed:

CA Auto Update Applications

Community Applications

CA Backup/Restore Appdata

CA Cleanup Appdata

Dynamix Active Streams

Dynamix Cache Directories

Dynamix Date Time

Dynamix SSD Trim

Dynamix System Information

Dynamix System Statistics

Dynamix System Temperature

Fix Common Problems

GPU Statistics

Nerd Tools

Preclear Disks

Unassigned Devices

Unassigned Devices Plus

Unraid Nvidia

I do not have any VMs installed or running.

Steps I've done so far:

- Flashed BIOS to latest version being 1.30

- Disabled all power management features

- No overclocking at all

- Ran Memtest86 and it all came back fine

- Reseated CPU and all cards

- Replaced motherboard

- Tried both Ethernet ports

- Tried letting Unraid run idle with no dockers active

Full syslog starting from 4_9_2020.txt diagnostics-20200412-1246.zip

Dissones4U · April 12, 2020

49 minutes ago, Kreavan said:

My Unraid build is new (still on trial version for 8 more days).

Make sure you're USB flash is on a USB2 port, USB3 is known to cause random crashing. I'll look at the diagnostics when I get the chance...

Kreavan · April 12, 2020

57 minutes ago, Dissones4U said:

Make sure you're USB flash is on a USB2 port, USB3 is known to cause random crashing. I'll look at the diagnostics when I get the chance...

I did have the USB flash on a 3.1 slot. I've moved it to a USB 2.0 slot now.

I also went into the BIOS and changed the "Power Supply Idle Control" setting from "Auto" to "Typical Current Idle" as I saw other forum posts mentioning that could be a problem.

Back up and running for now. Hopefully that fixes it.

JorgeB · April 13, 2020

Also see this, overclock RAM is known to cause stability issues (even data corruption).

Kreavan · April 14, 2020

On 4/13/2020 at 4:15 AM, johnnie.black said:

Also see this, overclock RAM is known to cause stability issues (even data corruption).

I'm running 4 sticks dual channel at 2133mhz not OC.

Kreavan · April 14, 2020

So, the hard lock happened again last night at 2:54:28. Attached the latest syslog.

syslog_4_14_2020.txt

Dissones4U · April 14, 2020

As always if anyone sees something wrong or misinterpreted please feel free to correct me, this is my best guess...

6 hours ago, Kreavan said:

hard lock happened again

Quote

Apr 12 10:14:27 Atlantis kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 12 10:14:27 Atlantis kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs

There are lots of these (↑), related to Nvidia. If I'm not mistaken it means you've installed a modified version of unRaid and would need to seek support for that plugin here. Within that topic someone says they have the same error and then system gets unresponsive.

Quote

Apr 9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Apr 9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: device [1022:1453] error status/mask=00000040/00006000
Apr 9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: [ 6] BadTLP

There are lots of these too and according to this thread it can cause cause a locked console. This thread on the unRaid forum says it may be fixed by moving the offending controller. Between this PCIe error and the PCI Bus with Nvidia I think this is where I'd start (your log is completely spammed with these). One final thread about the PCIe error mentions Nvidia too.

Kreavan · April 15, 2020

22 hours ago, Dissones4U said:

As always if anyone sees something wrong or misinterpreted please feel free to correct me, this is my best guess...

There are lots of these (↑), related to Nvidia. If I'm not mistaken it means you've installed a modified version of unRaid and would need to seek support for that plugin here. Within that topic someone says they have the same error and then system gets unresponsive.

There are lots of these too and according to this thread it can cause cause a locked console. This thread on the unRaid forum says it may be fixed by moving the offending controller. Between this PCIe error and the PCI Bus with Nvidia I think this is where I'd start (your log is completely spammed with these). One final thread about the PCIe error mentions Nvidia too.

I've uninstalled the GPU Statistics plugin as I saw that could be the cause of the constant errors of the following:

Apr 15 12:59:13 Atlantis kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 15 12:59:13 Atlantis kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs Apr 15 13:01:11 Atlantis emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin remove gpustat.plg

I also added to the kernel "rcu_nocbs=0-31" as I saw in a video from SpaceInvaderOne that helped stability with his Ryzen build.

Lastly, I noticed in most of the instances of my hard locks, the server appeared to be doing a NIC check of some sort. It happened again last night and in the process of that, the Netgear switch my server is plugged into froze too. I read that some switches do not support bonding and could cause problems. So, I've disabled bonding as well.

So far, 1hr in, the resourse sanity check error has not popped back up on the syslog. No lock ups yet, but usually they tend to happen overnight anyway. Will post the results tomorrow.

Thanks!

Kreavan · April 16, 2020

So, the server did not lock up overnight like it has previously. However, earlier today I did lose connectivity to the web GUI and seemed network activity ceased completely on the server. I checked the Syslog and seems docker0 had some issues.

Apr 16 10:28:05 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state
Apr 16 10:28:05 Atlantis kernel: vethd439256: renamed from eth0
Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Interface veth67de24f.IPv6 no longer relevant for mDNS.
Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Leaving mDNS multicast group on interface veth67de24f.IPv6 with address fe80::ec65:dbff:fe4a:32e2.
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state
Apr 16 10:28:06 Atlantis kernel: device veth67de24f left promiscuous mode
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state
Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Withdrawing address record for fe80::ec65:dbff:fe4a:32e2 on veth67de24f.
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered blocking state
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered disabled state
Apr 16 10:28:06 Atlantis kernel: device veth6f25b29 entered promiscuous mode
Apr 16 10:28:06 Atlantis kernel: IPv6: ADDRCONF(NETDEV_UP): veth6f25b29: link is not ready
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered blocking state
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered forwarding state
Apr 16 10:28:06 Atlantis kernel: eth0: renamed from veth9d8220e
Apr 16 10:28:06 Atlantis kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6f25b29: link becomes ready
Apr 16 10:28:08 Atlantis avahi-daemon[6289]: Joining mDNS multicast group on interface veth6f25b29.IPv6 with address fe80::80e3:9ff:fe4f:1367.
Apr 16 10:28:08 Atlantis avahi-daemon[6289]: New relevant interface veth6f25b29.IPv6 for mDNS.
Apr 16 10:28:08 Atlantis avahi-daemon[6289]: Registering new address record for fe80::80e3:9ff:fe4f:1367 on veth6f25b29.*.

How do I find out which one of my dockers is docker0?

Dissones4U · April 17, 2020

On 4/16/2020 at 2:14 PM, Kreavan said:

the server did not lock up overnight like it has previously

good, that's a start

On 4/16/2020 at 2:14 PM, Kreavan said:

How do I find out which one of my dockers is docker0?

docker0 is simply the network interface that docker creates to access other dockers and your internet see here for description

Kreavan · April 17, 2020

I'm still experiencing random issues with the network connectivity as mentioned in my last post. Based on what I'm seeing in the Syslog, I'm thinking it might have something to do with the Binhex-DelugeVPN docker as I'm finding references of my VPN IP in some of the log lines before it happens.

So, I've shut down that docker and will see if that stabilizes things.

Kreavan · April 19, 2020

The server hard locked sometime last night. The syslog doesn't show any new entries from the last time I checked it. So, I'm a bit lost now. Guess it is something in the BIOS maybe. Any thoughts on what I could check/change there to stabilize my build? I'll try to photos of my current BIOS settings and post them after my shift.

Kreavan · April 20, 2020

Attached are the screenshots of my BIOS settings. Had to break them into separate zip files as I couldn't upload as a single large zip.

4.zip 3.zip 2.zip 1.zip

Kreavan · April 21, 2020

Experienced another hard lock last night while I was asleep. This morning, I did find the Global C State option in the BIOS and have changed it from "Auto" to "Disabled". Hoping this fixes the instability. Keep you posted.

Kreavan · April 22, 2020

My server hard crashed again last night. This seems to be the trend of it hard crashing overnight when I'm asleep. It runs fine all day. I'm running out of ideas of what could be causing this.

meep · April 22, 2020

No help, but I'm in a similar boat with a similar system (Asrock Taichi X399, TR 2950X)

The system was pretty solid for months, but I made a few changes over the past couple of weeks and it's been very unstable since. Like you, it will run fine for a few hours, but then will lock up overnight, and will take a few attempts to boot reliably.

I ran a memtest today, and it froze at 2 hrs, so that's likely a clue. I've removed the extra 32GB I added and will run again overnight to see if that's the culprit.

Not a solution for you, i know, but just wanted to say I empathise!

Kreavan · April 22, 2020

1 hour ago, meep said:

No help, but I'm in a similar boat with a similar system (Asrock Taichi X399, TR 2950X)

The system was pretty solid for months, but I made a few changes over the past couple of weeks and it's been very unstable since. Like you, it will run fine for a few hours, but then will lock up overnight, and will take a few attempts to boot reliably.

I ran a memtest today, and it froze at 2 hrs, so that's likely a clue. I've removed the extra 32GB I added and will run again overnight to see if that's the culprit.

Not a solution for you, i know, but just wanted to say I empathise!

I appreciate that I'm not the only one. I think what my next step will be is to run Windows 10 off a USB and run Ryzen Master. Maybe that'll give me some more info.

Kreavan · April 23, 2020

Results of my Windows 10 test. It loaded up fine. I launched Ryzen Master and all the temps are normal. Also ran CPU-Z, Prime95, and Cinebench. Stress test the machine and nothing crashed.

One thing to note, I did find out that "Global C-state" in the BIOS was still set to "Auto" when I thought I had disabled it. It is now turned off for sure. So, I'll find out tomorrow if another hard lock occurs overnight. Keeping my fingers crossed.

meep · April 24, 2020

So the new memory was the problem for me. I had 64gb in 4 dimms. I’d added another 4x 8GB, but otherwise identical chips and the fun started! Since I removed them, all has been well with the world again.

now, I just need to figure out if it’s all the new dimms, just one of them, or maybe one or more slots. I have days if not weeks of testing ahead, but at least I feel I’m on the right track.

Kreavan · April 25, 2020

Ok, so I'm on day 3 of zero crashes. Looks like my issue has been fixed. Summary below of the changes that appear to have led me to this stable result.

- Moved Unraid USB to 2.0 slot.

- Flashed BIOS to latest version (1.30)

- Disabled all power management features

- No overclocking at all (Everything default or Auto)

- Changed BIOS setting "Power Supply Idle Control" to "Typical Current Idle"

- Uninstalled GPU Statistics plugin

- Added "rcu_nocbs=0-31" to the kernel

- Disabled Bonding and Bridging in Network Settings

- Changed BIOS setting "Global C State" to "Disabled"

[SOLVED] Troubleshooting hard lock issue - Please help

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation