[SOLVED] Troubleshooting hard lock issue - Please help


Recommended Posts

I've been trying to fix intermittent hard locks on my threadripper build, but have not been successful.  It has happened twice in the last two nights.  There are some similarities I found in the syslog during the times right before it happens.  I'm not too familiar with the entries though, and was hoping someone here might be able to look at them and know what it is related to.

The two syslog files are just the snippets of activity right before the hard locks occur.

Build details:
ASRock x399 Phantom Gaming 6 (BIOS v1.30)
AMD Threadripper 2950x
32GB Corsair Vengeance LPX DDR4 3200
9207-8i HBA SAS PCIe 3.0 adapter
Samsung 860 QVO 1tb (x2 in RAID 1 as cache drive)
Western Digital Red 8tb (x5) - Main Array
Western Digital Red 4tb (x2) - One as an Unassigned drive


Thanks!

syslog_4_11_2020.txt syslog_4_12_2020.txt

Link to comment
36 minutes ago, Kreavan said:

The two syslog files are just the snippets

In general snippets will not suffice, you should post your full diagnostic, include any hardware or software changes (including physically moving the box) within four weeks of the current issue starting. Also include any steps you've taken to troubleshoot your rig including running memtest or turning off and restarting each instance of the following one at a time:

  • dockers
  • vms
  • plugins

The thread below provides lots of information on where to start. Aside from that I've linked to the FAQ regarding Ryzen based servers.

 

Have you read the FAQ - What can I do to keep my Ryzen based server from crashing/locking up with Unraid?

 

Edited by Dissones4U
Link to comment

Here are more details as requested.  My Unraid build is new (still on trial version for 8 more days).  This hard lock issue has been happening pretty much since day one intermittently.  There have even been a few instances where my netgear gigabit switch would die along with it, which was strange. Forgot to add in my hardware build, I have an Nvidia p2000 gpu.  No consistency in the crashes.  Sometimes it will go 5 days without an issues, and sometimes only 12-15hrs.

 

I have the following dockers installed, and the issue has happened even with all of them stopped.

- Plex

- Binhex Delugevpn

- Binhex Sonarr

- Binhex Radarr

- Binhex Krusader

- Jackett

 

Plugins installed:

CA Auto Update Applications

Community Applications

CA Backup/Restore Appdata

CA Cleanup Appdata

Dynamix Active Streams

Dynamix Cache Directories

Dynamix Date Time

Dynamix SSD Trim

Dynamix System Information

Dynamix System Statistics

Dynamix System Temperature

Fix Common Problems

GPU Statistics

Nerd Tools

Preclear Disks

Unassigned Devices

Unassigned Devices Plus

Unraid Nvidia

 

I do not have any VMs installed or running.

 

Steps I've done so far:

- Flashed BIOS to latest version being 1.30

- Disabled all power management features

- No overclocking at all

- Ran Memtest86 and it all came back fine

- Reseated CPU and all cards

- Replaced motherboard

- Tried both Ethernet ports

- Tried letting Unraid run idle with no dockers active

Full syslog starting from 4_9_2020.txt diagnostics-20200412-1246.zip

Link to comment
57 minutes ago, Dissones4U said:

Make sure you're USB flash is on a USB2 port, USB3 is known to cause random crashing. I'll look at the diagnostics when I get the chance...

I did have the USB flash on a 3.1 slot.  I've moved it to a USB 2.0 slot now.

I also went into the BIOS and changed the "Power Supply Idle Control" setting from "Auto" to "Typical Current Idle" as I saw other forum posts mentioning that could be a problem.

Back up and running for now.  Hopefully that fixes it.

Link to comment

As always if anyone sees something wrong or misinterpreted please feel free to correct me, this is my best guess...

6 hours ago, Kreavan said:

hard lock happened again

Quote

Apr 12 10:14:27 Atlantis kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 12 10:14:27 Atlantis kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs

There are lots of these (↑), related to Nvidia. If I'm not mistaken it means you've installed a modified version of unRaid and would need to seek support for that plugin here. Within that topic someone says they have the same error and then system gets unresponsive.

Quote

Apr  9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Apr  9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1:   device [1022:1453] error status/mask=00000040/00006000
Apr  9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1:    [ 6] BadTLP                

There are lots of these too and according to this thread it can cause cause a locked console. This thread on the unRaid forum says it may be fixed by moving the offending controller. Between this PCIe error and the PCI Bus with Nvidia I think this is where I'd start (your log is completely spammed with these). One final thread about the PCIe error mentions Nvidia too.

Link to comment
22 hours ago, Dissones4U said:

As always if anyone sees something wrong or misinterpreted please feel free to correct me, this is my best guess...

There are lots of these (↑), related to Nvidia. If I'm not mistaken it means you've installed a modified version of unRaid and would need to seek support for that plugin here. Within that topic someone says they have the same error and then system gets unresponsive.

There are lots of these too and according to this thread it can cause cause a locked console. This thread on the unRaid forum says it may be fixed by moving the offending controller. Between this PCIe error and the PCI Bus with Nvidia I think this is where I'd start (your log is completely spammed with these). One final thread about the PCIe error mentions Nvidia too.

 

I've uninstalled the GPU Statistics plugin as I saw that could be the cause of the constant errors of the following:

 

Apr 15 12:59:13 Atlantis kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 15 12:59:13 Atlantis kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs Apr 15 13:01:11 Atlantis emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin remove gpustat.plg

 

I also added to the kernel "rcu_nocbs=0-31" as I saw in a video from SpaceInvaderOne that helped stability with his Ryzen build.

Lastly, I noticed in most of the instances of my hard locks, the server appeared to be doing a NIC check of some sort.  It happened again last night and in the process of that, the Netgear switch my server is plugged into froze too.  I read that some switches do not support bonding and could cause problems.  So, I've disabled bonding as well.

So far, 1hr in, the resourse sanity check error has not popped back up on the syslog.  No lock ups yet, but usually they tend to happen overnight anyway.  Will post the results tomorrow.

 

Thanks!

Link to comment

So, the server did not lock up overnight like it has previously.  However, earlier today I did lose connectivity to the web GUI and seemed network activity ceased completely on the server.  I checked the Syslog and seems docker0 had some issues.

Apr 16 10:28:05 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state
Apr 16 10:28:05 Atlantis kernel: vethd439256: renamed from eth0
Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Interface veth67de24f.IPv6 no longer relevant for mDNS.
Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Leaving mDNS multicast group on interface veth67de24f.IPv6 with address fe80::ec65:dbff:fe4a:32e2.
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state
Apr 16 10:28:06 Atlantis kernel: device veth67de24f left promiscuous mode
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state
Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Withdrawing address record for fe80::ec65:dbff:fe4a:32e2 on veth67de24f.
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered blocking state
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered disabled state
Apr 16 10:28:06 Atlantis kernel: device veth6f25b29 entered promiscuous mode
Apr 16 10:28:06 Atlantis kernel: IPv6: ADDRCONF(NETDEV_UP): veth6f25b29: link is not ready
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered blocking state
Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered forwarding state
Apr 16 10:28:06 Atlantis kernel: eth0: renamed from veth9d8220e
Apr 16 10:28:06 Atlantis kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6f25b29: link becomes ready
Apr 16 10:28:08 Atlantis avahi-daemon[6289]: Joining mDNS multicast group on interface veth6f25b29.IPv6 with address fe80::80e3:9ff:fe4f:1367.
Apr 16 10:28:08 Atlantis avahi-daemon[6289]: New relevant interface veth6f25b29.IPv6 for mDNS.
Apr 16 10:28:08 Atlantis avahi-daemon[6289]: Registering new address record for fe80::80e3:9ff:fe4f:1367 on veth6f25b29.*.

How do I find out which one of my dockers is docker0?

Link to comment

I'm still experiencing random issues with the network connectivity as mentioned in my last post.  Based on what I'm seeing in the Syslog, I'm thinking it might have something to do with the Binhex-DelugeVPN docker as I'm finding references of my VPN IP in some of the log lines before it happens.

So, I've shut down that docker and will see if that stabilizes things.

  • Like 1
Link to comment

The server hard locked sometime last night.  The syslog doesn't show any new entries from the last time I checked it.  So, I'm a bit lost now.  Guess it is something in the BIOS maybe.  Any thoughts on what I could check/change there to stabilize my build?  I'll try to photos of my current BIOS settings and post them after my shift.

Link to comment

No help, but I'm in a similar boat with a similar system (Asrock Taichi X399, TR 2950X)

 

The system was pretty solid for months, but I made a few changes over the past couple of weeks and it's been very unstable since. Like you, it will run fine for a few hours, but then will lock up overnight, and will take a few attempts to boot reliably.

 

I ran a memtest today, and it froze at 2 hrs, so that's likely a clue. I've removed the extra 32GB I added and will run again overnight to see if that's the culprit.

 

Not a solution for you, i know, but just wanted to say I empathise!

 

 

Link to comment
1 hour ago, meep said:

No help, but I'm in a similar boat with a similar system (Asrock Taichi X399, TR 2950X)

 

The system was pretty solid for months, but I made a few changes over the past couple of weeks and it's been very unstable since. Like you, it will run fine for a few hours, but then will lock up overnight, and will take a few attempts to boot reliably.

 

I ran a memtest today, and it froze at 2 hrs, so that's likely a clue. I've removed the extra 32GB I added and will run again overnight to see if that's the culprit.

 

Not a solution for you, i know, but just wanted to say I empathise!

 

 

I appreciate that I'm not the only one.  I think what my next step will be is to run Windows 10 off a USB and run Ryzen Master.  Maybe that'll give me some more info.

Link to comment

Results of my Windows 10 test.  It loaded up fine.  I launched Ryzen Master and all the temps are normal.  Also ran CPU-Z, Prime95, and Cinebench.  Stress test the machine and nothing crashed.

 

One thing to note, I did find out that "Global C-state" in the BIOS was still set to "Auto" when I thought I had disabled it.  It is now turned off for sure.  So, I'll find out tomorrow if another hard lock occurs overnight.  Keeping my fingers crossed.

Link to comment

So the new memory was the problem for me. I had 64gb in 4 dimms. I’d added another 4x 8GB, but otherwise identical chips and the fun started! Since I removed them, all has been well with the world again.

 

now, I just need to figure out if it’s all the new dimms, just one of them, or maybe one or more slots. I have  days if not weeks of testing ahead, but at least I feel I’m on the right track. 

Link to comment

Ok, so I'm on day 3 of zero crashes.  Looks like my issue has been fixed.  Summary below of the changes that appear to have led me to this stable result.

 

- Moved Unraid USB to 2.0 slot.

- Flashed BIOS to latest version (1.30)

- Disabled all power management features

- No overclocking at all (Everything default or Auto)

- Changed BIOS setting "Power Supply Idle Control" to "Typical Current Idle"

- Uninstalled GPU Statistics plugin

- Added "rcu_nocbs=0-31" to the kernel

- Disabled Bonding and Bridging in Network Settings

- Changed BIOS setting "Global C State" to "Disabled"

  • Like 2
Link to comment
  • JorgeB changed the title to [SOLVED] Troubleshooting hard lock issue - Please help

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.