kharntiitar Posted June 21, 2020 Posted June 21, 2020 (edited) Version 6.8.3 2020-03-05 Hi all, So I've spent a bit of time narrowing this down, but in the interest of full disclosure, I will explain everything I have done up until now (as I have made some big changes). A couple of weeks ago, I upgraded my unraid server to a ryzen 7 2700x with asus x470-pro motherboard. I also added two M.2 NVME drives. I used one of the NVME drives as cache, and followed instructions here to move my appdata to the second NVME drive. I noted that after about 72 hours I was having an issue where the server was unresponsive, and (after a bit of troubleshooting) the cache drive had become disconnected. This occurred twice more, after just a few hours, so I took that drive back to the shop, and they are testing it for me. I had a spare NVME drive, which was a bit older/slower, but put that in as the cache drive. Then I started getting different errors, I noticed that multiple instances of "runc" were getting high CPU > 100% as were several dockers. shutting down dockers was still possible, but it takes about an hour (as compared to a normal 30 seconds to 2 minutes). I installed a second NIC, and set up my dockers and VMs to run through the second NIC, but it occurred again, high "runc" and docker CPU utilisation, but I was able to communicate just fine with my VMs (suggesting it's not the entire network stack, and is definitely just eth0). It was at this point I noticed something similar to the following: Jun 21 10:10:48 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 21 10:10:58 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Down Jun 21 10:11:01 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 21 10:16:47 apcupsd[21860]: Communications with UPS lost. Jun 21 10:17:16 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 21 10:17:29 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Down This was repeated throughout the syslog, and was echoed as interface up/down on the switch that my server connects too. Today also, I found in my syslog a trace which would appear to be exactly what the issue is, however I'm not sure what it means exactly; or what I can do about it. Jun 21 09:19:22 kernel: ------------[ cut here ]------------ Jun 21 09:19:22 kernel: NETDEV WATCHDOG: eth0 (igb): transmit queue 0 timed out Jun 21 09:19:22 kernel: WARNING: CPU: 11 PID: 28598 at net/sched/sch_generic.c:465 dev_watchdog+0x161/0x1bb Jun 21 09:19:22 kernel: Modules linked in: vhost_net tun vhost tap kvm_amd ccp kvm xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 ip6table_filter ip6_tables wireguard ip6_udp_tunnel udp_tunnel iptable_raw iptable_mangle xt_nat veth macvlan ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs nfsd lockd grace sunrpc md_mod e1000e igb(O) edac_mce_amd crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mpt3sas pcbc wmi_bmof mxm_wmi aesni_intel aes_x86_64 crypto_simd i2c_piix4 cryptd i2c_core nvme raid_class ahci k10temp libahci scsi_transport_sas glue_helper nvme_core wmi button [last unloaded: ccp] Jun 21 09:19:22 kernel: CPU: 11 PID: 28598 Comm: runc Tainted: G O 4.19.107-Unraid #1 Jun 21 09:19:22 kernel: Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 Jun 21 09:19:22 kernel: RIP: 0010:dev_watchdog+0x161/0x1bb Jun 21 09:19:22 kernel: Code: 5f 94 00 00 75 39 48 89 ef c6 05 4e 5f 94 00 01 e8 a1 a8 fd ff 44 89 e9 48 89 ee 48 c7 c7 57 2a da 81 48 89 c2 e8 cd 0b af ff <0f> 0b eb 11 41 ff c5 48 81 c2 40 01 00 00 41 39 cd 75 95 eb 13 48 Jun 21 09:19:22 kernel: RSP: 0018:ffff8887fe8c3ea0 EFLAGS: 00010286 Jun 21 09:19:22 kernel: RAX: 0000000000000000 RBX: ffff8887f770e438 RCX: 0000000000000007 Jun 21 09:19:22 kernel: RDX: 0000000000000b7e RSI: 0000000000000002 RDI: ffff8887fe8d64f0 Jun 21 09:19:22 kernel: RBP: ffff8887f770e000 R08: 0000000000000003 R09: 0000000000000400 Jun 21 09:19:22 kernel: R10: 0000000000000000 R11: 0000000000000058 R12: ffff8887f770e41c Jun 21 09:19:22 kernel: R13: 0000000000000000 R14: ffff8887f6f06940 R15: 000000000000000b Jun 21 09:19:22 kernel: FS: 0000000000d6a880(0000) GS:ffff8887fe8c0000(0000) knlGS:0000000000000000 Jun 21 09:19:22 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 21 09:19:22 kernel: CR2: 000014fd54626880 CR3: 0000000242d90000 CR4: 00000000003406e0 Jun 21 09:19:22 kernel: Call Trace: Jun 21 09:19:22 kernel: Jun 21 09:19:22 kernel: call_timer_fn+0x18/0x7b Jun 21 09:19:22 kernel: ? qdisc_reset+0xc0/0xc0 Jun 21 09:19:22 kernel: expire_timers+0x7e/0x8d Jun 21 09:19:22 kernel: run_timer_softirq+0x72/0x120 Jun 21 09:19:22 kernel: ? enqueue_hrtimer.isra.0+0x23/0x27 Jun 21 09:19:22 kernel: ? __hrtimer_run_queues+0xdd/0x10b Jun 21 09:19:22 kernel: ? ktime_get+0x44/0x95 Jun 21 09:19:22 kernel: __do_softirq+0xc9/0x1d7 Jun 21 09:19:22 kernel: irq_exit+0x5e/0x9d Jun 21 09:19:22 kernel: smp_apic_timer_interrupt+0x80/0x93 Jun 21 09:19:22 kernel: apic_timer_interrupt+0xf/0x20 Jun 21 09:19:22 kernel: Jun 21 09:19:22 kernel: RIP: 0010:prepend_path+0xb1/0x205 Jun 21 09:19:22 kernel: Code: 44 24 14 41 89 c2 eb 13 49 39 c6 74 4c 4d 8b 5e 18 4c 8d 68 20 49 89 c6 4c 89 db 48 8b 44 24 08 48 3b 58 08 74 7b 49 8b 55 00 <48> 39 da 74 09 4c 8b 5b 18 49 39 db 75 45 48 39 da 49 8b 46 10 74 Jun 21 09:19:22 kernel: RSP: 0018:ffffc9000d703cf8 EFLAGS: 00000283 ORIG_RAX: ffffffffffffff13 Jun 21 09:19:22 kernel: RAX: ffff88822463f088 RBX: ffff8887a0fb5140 RCX: ffffc9000d703da8 Jun 21 09:19:22 kernel: RDX: ffff8887a0fb5140 RSI: ffff88822463f088 RDI: ffffc9000d703da8 Jun 21 09:19:22 kernel: RBP: ffffc9000d703d68 R08: 0000000000000000 R09: ffffc9000d703da8 Jun 21 09:19:22 kernel: R10: 000000000008b216 R11: ffff8887a0fb5140 R12: 000000000053cd36 Jun 21 09:19:22 kernel: R13: ffff8881726b4920 R14: ffff8881726b4900 R15: ffffc9000d703d64 Jun 21 09:19:22 kernel: ? __dentry_path.part.0+0xa7/0x115 Jun 21 09:19:22 kernel: __d_path+0x59/0x86 Jun 21 09:19:22 kernel: seq_path_root+0x40/0x95 Jun 21 09:19:22 kernel: show_mountinfo+0xc5/0x260 Jun 21 09:19:22 kernel: seq_read+0x231/0x313 Jun 21 09:19:22 kernel: __vfs_read+0x32/0x132 Jun 21 09:19:22 kernel: ? __switch_to_asm+0x41/0x70 Jun 21 09:19:22 kernel: vfs_read+0xa4/0x124 Jun 21 09:19:22 kernel: ksys_read+0x60/0xb2 Jun 21 09:19:22 kernel: do_syscall_64+0x57/0xf2 Jun 21 09:19:22 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 21 09:19:22 kernel: RIP: 0033:0x4a41c0 Jun 21 09:19:22 kernel: Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30 Jun 21 09:19:22 kernel: RSP: 002b:000000c00005db10 EFLAGS: 00000202 ORIG_RAX: 0000000000000000 Jun 21 09:19:22 kernel: RAX: ffffffffffffffda RBX: 000000c00001c000 RCX: 00000000004a41c0 Jun 21 09:19:22 kernel: RDX: 0000000000001000 RSI: 000000c0000e3000 RDI: 0000000000000004 Jun 21 09:19:22 kernel: RBP: 000000c00005db60 R08: 0000000000000000 R09: 0000000000000000 Jun 21 09:19:22 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000001 Jun 21 09:19:22 kernel: R13: 0000000000000075 R14: 0000000000895f8a R15: 0000000000000038 Jun 21 09:19:22 kernel: ---[ end trace ad1ca502756d72cd ]--- The NIC that it is talking about, is the onboard nic, lspci output is here: 07:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03) Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 38 Region 0: Memory at fc500000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at b000 [size=32] Region 3: Memory at fc520000 (32-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable+ Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #7, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [140 v1] Device Serial Number 4c-ed-fb-ff-ff-7a-07-47 Capabilities: [1a0 v1] Transaction Processing Hints Device specific mode supported Steering table in TPH capability structure Kernel driver in use: igb Kernel modules: igb If I stop all dockers from cli (I've had to write a script and do it via screen, as I keep getting disconnected), everything goes back to normal, and I can start them again fine, without rebooting the system. At least, until the next time it dies. If anyone might be able to point me in the right direction as to what I've done wrong (and I'm sure it's me, as I made a heap of changes) I'd really appreciate it.. If I can provide more info, or clarify something, just say the word. Edited June 23, 2020 by kharntiitar Mark as solved Quote
kharntiitar Posted June 21, 2020 Author Posted June 21, 2020 Of note, one other thing I did was to disable C-States, in case the NVME drive was somehow going to sleep, I haven't undone this change. Quote
JorgeB Posted June 21, 2020 Posted June 21, 2020 Since it's a Ryzen server see here first: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-819173 Quote
kharntiitar Posted June 21, 2020 Author Posted June 21, 2020 Thanks @johnnie.black I note two things from that: First that my alleged max supported memory speed for 2nd gen is not 3200 as per what the motherboard says (and what I've configured), I'll lower this and see how that goes. Second, I also note from the link in that thread that I shouldn't be turning of c-states, and instead leave it to auto. I'll do that as well.. Will let you know how I get on tomorrow Quote
kharntiitar Posted June 21, 2020 Author Posted June 21, 2020 Purely out of interest, setting the memory to defaults, changed the speed to 2133 MHz. Quote
kharntiitar Posted June 22, 2020 Author Posted June 22, 2020 So far so good, uptime 21 hours 47 minutes since the ram speed change. Being that I'd previously had the same C-State settings, I doubt that will have made a difference, but who knows?! Will keep monitoring Quote
kharntiitar Posted June 23, 2020 Author Posted June 23, 2020 Well, uptime is now 49 Hours, I'm going to go ahead and close this off as solved. Thank you very much @johnnie.black for your help and pointing me to the solution!!! TL;DR dropped ram speed from 3200MHz to (auto) 2133MHz solved the issue. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.