Kreavan Posted April 12, 2020 Share Posted April 12, 2020 I've been trying to fix intermittent hard locks on my threadripper build, but have not been successful. It has happened twice in the last two nights. There are some similarities I found in the syslog during the times right before it happens. I'm not too familiar with the entries though, and was hoping someone here might be able to look at them and know what it is related to. The two syslog files are just the snippets of activity right before the hard locks occur. Build details: ASRock x399 Phantom Gaming 6 (BIOS v1.30) AMD Threadripper 2950x 32GB Corsair Vengeance LPX DDR4 3200 9207-8i HBA SAS PCIe 3.0 adapter Samsung 860 QVO 1tb (x2 in RAID 1 as cache drive) Western Digital Red 8tb (x5) - Main Array Western Digital Red 4tb (x2) - One as an Unassigned drive Thanks! syslog_4_11_2020.txt syslog_4_12_2020.txt Quote Link to comment
Dissones4U Posted April 12, 2020 Share Posted April 12, 2020 (edited) 36 minutes ago, Kreavan said: The two syslog files are just the snippets In general snippets will not suffice, you should post your full diagnostic, include any hardware or software changes (including physically moving the box) within four weeks of the current issue starting. Also include any steps you've taken to troubleshoot your rig including running memtest or turning off and restarting each instance of the following one at a time: dockers vms plugins The thread below provides lots of information on where to start. Aside from that I've linked to the FAQ regarding Ryzen based servers. Have you read the FAQ - What can I do to keep my Ryzen based server from crashing/locking up with Unraid? Edited April 12, 2020 by Dissones4U Quote Link to comment
Kreavan Posted April 12, 2020 Author Share Posted April 12, 2020 Here are more details as requested. My Unraid build is new (still on trial version for 8 more days). This hard lock issue has been happening pretty much since day one intermittently. There have even been a few instances where my netgear gigabit switch would die along with it, which was strange. Forgot to add in my hardware build, I have an Nvidia p2000 gpu. No consistency in the crashes. Sometimes it will go 5 days without an issues, and sometimes only 12-15hrs. I have the following dockers installed, and the issue has happened even with all of them stopped. - Plex - Binhex Delugevpn - Binhex Sonarr - Binhex Radarr - Binhex Krusader - Jackett Plugins installed: CA Auto Update Applications Community Applications CA Backup/Restore Appdata CA Cleanup Appdata Dynamix Active Streams Dynamix Cache Directories Dynamix Date Time Dynamix SSD Trim Dynamix System Information Dynamix System Statistics Dynamix System Temperature Fix Common Problems GPU Statistics Nerd Tools Preclear Disks Unassigned Devices Unassigned Devices Plus Unraid Nvidia I do not have any VMs installed or running. Steps I've done so far: - Flashed BIOS to latest version being 1.30 - Disabled all power management features - No overclocking at all - Ran Memtest86 and it all came back fine - Reseated CPU and all cards - Replaced motherboard - Tried both Ethernet ports - Tried letting Unraid run idle with no dockers active Full syslog starting from 4_9_2020.txt diagnostics-20200412-1246.zip Quote Link to comment
Dissones4U Posted April 12, 2020 Share Posted April 12, 2020 49 minutes ago, Kreavan said: My Unraid build is new (still on trial version for 8 more days). Make sure you're USB flash is on a USB2 port, USB3 is known to cause random crashing. I'll look at the diagnostics when I get the chance... Quote Link to comment
Kreavan Posted April 12, 2020 Author Share Posted April 12, 2020 57 minutes ago, Dissones4U said: Make sure you're USB flash is on a USB2 port, USB3 is known to cause random crashing. I'll look at the diagnostics when I get the chance... I did have the USB flash on a 3.1 slot. I've moved it to a USB 2.0 slot now. I also went into the BIOS and changed the "Power Supply Idle Control" setting from "Auto" to "Typical Current Idle" as I saw other forum posts mentioning that could be a problem. Back up and running for now. Hopefully that fixes it. Quote Link to comment
JorgeB Posted April 13, 2020 Share Posted April 13, 2020 Also see this, overclock RAM is known to cause stability issues (even data corruption). Quote Link to comment
Kreavan Posted April 14, 2020 Author Share Posted April 14, 2020 On 4/13/2020 at 4:15 AM, johnnie.black said: Also see this, overclock RAM is known to cause stability issues (even data corruption). I'm running 4 sticks dual channel at 2133mhz not OC. Quote Link to comment
Kreavan Posted April 14, 2020 Author Share Posted April 14, 2020 So, the hard lock happened again last night at 2:54:28. Attached the latest syslog. syslog_4_14_2020.txt Quote Link to comment
Dissones4U Posted April 14, 2020 Share Posted April 14, 2020 As always if anyone sees something wrong or misinterpreted please feel free to correct me, this is my best guess... 6 hours ago, Kreavan said: hard lock happened again Quote Apr 12 10:14:27 Atlantis kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 12 10:14:27 Atlantis kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs There are lots of these (↑), related to Nvidia. If I'm not mistaken it means you've installed a modified version of unRaid and would need to seek support for that plugin here. Within that topic someone says they have the same error and then system gets unresponsive. Quote Apr 9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Apr 9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: device [1022:1453] error status/mask=00000040/00006000 Apr 9 09:24:09 Atlantis kernel: pcieport 0000:40:01.1: [ 6] BadTLP There are lots of these too and according to this thread it can cause cause a locked console. This thread on the unRaid forum says it may be fixed by moving the offending controller. Between this PCIe error and the PCI Bus with Nvidia I think this is where I'd start (your log is completely spammed with these). One final thread about the PCIe error mentions Nvidia too. Quote Link to comment
Kreavan Posted April 15, 2020 Author Share Posted April 15, 2020 22 hours ago, Dissones4U said: As always if anyone sees something wrong or misinterpreted please feel free to correct me, this is my best guess... There are lots of these (↑), related to Nvidia. If I'm not mistaken it means you've installed a modified version of unRaid and would need to seek support for that plugin here. Within that topic someone says they have the same error and then system gets unresponsive. There are lots of these too and according to this thread it can cause cause a locked console. This thread on the unRaid forum says it may be fixed by moving the offending controller. Between this PCIe error and the PCI Bus with Nvidia I think this is where I'd start (your log is completely spammed with these). One final thread about the PCIe error mentions Nvidia too. I've uninstalled the GPU Statistics plugin as I saw that could be the cause of the constant errors of the following: Apr 15 12:59:13 Atlantis kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Apr 15 12:59:13 Atlantis kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs Apr 15 13:01:11 Atlantis emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin remove gpustat.plg I also added to the kernel "rcu_nocbs=0-31" as I saw in a video from SpaceInvaderOne that helped stability with his Ryzen build. Lastly, I noticed in most of the instances of my hard locks, the server appeared to be doing a NIC check of some sort. It happened again last night and in the process of that, the Netgear switch my server is plugged into froze too. I read that some switches do not support bonding and could cause problems. So, I've disabled bonding as well. So far, 1hr in, the resourse sanity check error has not popped back up on the syslog. No lock ups yet, but usually they tend to happen overnight anyway. Will post the results tomorrow. Thanks! Quote Link to comment
Kreavan Posted April 16, 2020 Author Share Posted April 16, 2020 So, the server did not lock up overnight like it has previously. However, earlier today I did lose connectivity to the web GUI and seemed network activity ceased completely on the server. I checked the Syslog and seems docker0 had some issues. Apr 16 10:28:05 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state Apr 16 10:28:05 Atlantis kernel: vethd439256: renamed from eth0 Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Interface veth67de24f.IPv6 no longer relevant for mDNS. Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Leaving mDNS multicast group on interface veth67de24f.IPv6 with address fe80::ec65:dbff:fe4a:32e2. Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state Apr 16 10:28:06 Atlantis kernel: device veth67de24f left promiscuous mode Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth67de24f) entered disabled state Apr 16 10:28:06 Atlantis avahi-daemon[6289]: Withdrawing address record for fe80::ec65:dbff:fe4a:32e2 on veth67de24f. Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered blocking state Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered disabled state Apr 16 10:28:06 Atlantis kernel: device veth6f25b29 entered promiscuous mode Apr 16 10:28:06 Atlantis kernel: IPv6: ADDRCONF(NETDEV_UP): veth6f25b29: link is not ready Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered blocking state Apr 16 10:28:06 Atlantis kernel: docker0: port 1(veth6f25b29) entered forwarding state Apr 16 10:28:06 Atlantis kernel: eth0: renamed from veth9d8220e Apr 16 10:28:06 Atlantis kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6f25b29: link becomes ready Apr 16 10:28:08 Atlantis avahi-daemon[6289]: Joining mDNS multicast group on interface veth6f25b29.IPv6 with address fe80::80e3:9ff:fe4f:1367. Apr 16 10:28:08 Atlantis avahi-daemon[6289]: New relevant interface veth6f25b29.IPv6 for mDNS. Apr 16 10:28:08 Atlantis avahi-daemon[6289]: Registering new address record for fe80::80e3:9ff:fe4f:1367 on veth6f25b29.*. How do I find out which one of my dockers is docker0? Quote Link to comment
Dissones4U Posted April 17, 2020 Share Posted April 17, 2020 On 4/16/2020 at 2:14 PM, Kreavan said: the server did not lock up overnight like it has previously good, that's a start On 4/16/2020 at 2:14 PM, Kreavan said: How do I find out which one of my dockers is docker0? docker0 is simply the network interface that docker creates to access other dockers and your internet see here for description Quote Link to comment
Kreavan Posted April 17, 2020 Author Share Posted April 17, 2020 I'm still experiencing random issues with the network connectivity as mentioned in my last post. Based on what I'm seeing in the Syslog, I'm thinking it might have something to do with the Binhex-DelugeVPN docker as I'm finding references of my VPN IP in some of the log lines before it happens. So, I've shut down that docker and will see if that stabilizes things. 1 Quote Link to comment
Kreavan Posted April 19, 2020 Author Share Posted April 19, 2020 The server hard locked sometime last night. The syslog doesn't show any new entries from the last time I checked it. So, I'm a bit lost now. Guess it is something in the BIOS maybe. Any thoughts on what I could check/change there to stabilize my build? I'll try to photos of my current BIOS settings and post them after my shift. Quote Link to comment
Kreavan Posted April 20, 2020 Author Share Posted April 20, 2020 Attached are the screenshots of my BIOS settings. Had to break them into separate zip files as I couldn't upload as a single large zip. 4.zip 3.zip 2.zip 1.zip Quote Link to comment
Kreavan Posted April 21, 2020 Author Share Posted April 21, 2020 Experienced another hard lock last night while I was asleep. This morning, I did find the Global C State option in the BIOS and have changed it from "Auto" to "Disabled". Hoping this fixes the instability. Keep you posted. Quote Link to comment
Kreavan Posted April 22, 2020 Author Share Posted April 22, 2020 My server hard crashed again last night. This seems to be the trend of it hard crashing overnight when I'm asleep. It runs fine all day. I'm running out of ideas of what could be causing this. Quote Link to comment
meep Posted April 22, 2020 Share Posted April 22, 2020 No help, but I'm in a similar boat with a similar system (Asrock Taichi X399, TR 2950X) The system was pretty solid for months, but I made a few changes over the past couple of weeks and it's been very unstable since. Like you, it will run fine for a few hours, but then will lock up overnight, and will take a few attempts to boot reliably. I ran a memtest today, and it froze at 2 hrs, so that's likely a clue. I've removed the extra 32GB I added and will run again overnight to see if that's the culprit. Not a solution for you, i know, but just wanted to say I empathise! Quote Link to comment
Kreavan Posted April 22, 2020 Author Share Posted April 22, 2020 1 hour ago, meep said: No help, but I'm in a similar boat with a similar system (Asrock Taichi X399, TR 2950X) The system was pretty solid for months, but I made a few changes over the past couple of weeks and it's been very unstable since. Like you, it will run fine for a few hours, but then will lock up overnight, and will take a few attempts to boot reliably. I ran a memtest today, and it froze at 2 hrs, so that's likely a clue. I've removed the extra 32GB I added and will run again overnight to see if that's the culprit. Not a solution for you, i know, but just wanted to say I empathise! I appreciate that I'm not the only one. I think what my next step will be is to run Windows 10 off a USB and run Ryzen Master. Maybe that'll give me some more info. Quote Link to comment
Kreavan Posted April 23, 2020 Author Share Posted April 23, 2020 Results of my Windows 10 test. It loaded up fine. I launched Ryzen Master and all the temps are normal. Also ran CPU-Z, Prime95, and Cinebench. Stress test the machine and nothing crashed. One thing to note, I did find out that "Global C-state" in the BIOS was still set to "Auto" when I thought I had disabled it. It is now turned off for sure. So, I'll find out tomorrow if another hard lock occurs overnight. Keeping my fingers crossed. Quote Link to comment
meep Posted April 24, 2020 Share Posted April 24, 2020 So the new memory was the problem for me. I had 64gb in 4 dimms. I’d added another 4x 8GB, but otherwise identical chips and the fun started! Since I removed them, all has been well with the world again. now, I just need to figure out if it’s all the new dimms, just one of them, or maybe one or more slots. I have days if not weeks of testing ahead, but at least I feel I’m on the right track. Quote Link to comment
Kreavan Posted April 25, 2020 Author Share Posted April 25, 2020 Ok, so I'm on day 3 of zero crashes. Looks like my issue has been fixed. Summary below of the changes that appear to have led me to this stable result. - Moved Unraid USB to 2.0 slot. - Flashed BIOS to latest version (1.30) - Disabled all power management features - No overclocking at all (Everything default or Auto) - Changed BIOS setting "Power Supply Idle Control" to "Typical Current Idle" - Uninstalled GPU Statistics plugin - Added "rcu_nocbs=0-31" to the kernel - Disabled Bonding and Bridging in Network Settings - Changed BIOS setting "Global C State" to "Disabled" 2 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.