[SOLVED] Crash within 18-24 hours


Recommended Posts

Version 6.8.3 2020-03-05

 

Hi all, So I've spent a bit of time narrowing this down, but in the interest of full disclosure, I will explain everything I have done up until now (as I have made some big changes).

 

A couple of weeks ago, I upgraded my unraid server to a ryzen 7 2700x with asus x470-pro motherboard. I also added two M.2 NVME drives. 

I used one of the NVME drives as cache, and followed instructions here to move my appdata to the second NVME drive.

 

I noted that after about 72 hours I was having an issue where the server was unresponsive, and (after a bit of troubleshooting) the cache drive had become disconnected. This occurred twice more, after just a few hours, so I took that drive back to the shop, and they are testing it for me.

 

I had a spare NVME drive, which was  a bit older/slower, but put that in as the cache drive. Then I started getting different errors, I noticed that multiple instances of "runc" were getting high CPU > 100% as were several dockers. shutting down dockers was still possible, but it takes about an hour (as compared to a normal 30 seconds to 2 minutes).

 

I installed a second NIC, and set up my dockers and VMs to run through the second NIC, but it occurred again, high "runc" and docker CPU utilisation, but I was able to communicate just fine with my VMs (suggesting it's not the entire network stack, and is definitely just eth0). It was at this point I noticed something similar to the following:

 

Jun 21 10:10:48 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Jun 21 10:10:58 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Down
Jun 21 10:11:01 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Jun 21 10:16:47 apcupsd[21860]: Communications with UPS lost.
Jun 21 10:17:16 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Jun 21 10:17:29 kernel: igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Down

This was repeated throughout the syslog, and was echoed as interface up/down on the switch that my server connects too. 

 

Today also, I found in my syslog a trace which would appear to be exactly what the issue is, however I'm not sure what it means exactly; or what I can do about it.

 


Jun 21 09:19:22 kernel: ------------[ cut here ]------------
Jun 21 09:19:22 kernel: NETDEV WATCHDOG: eth0 (igb): transmit queue 0 timed out
Jun 21 09:19:22 kernel: WARNING: CPU: 11 PID: 28598 at net/sched/sch_generic.c:465 dev_watchdog+0x161/0x1bb
Jun 21 09:19:22 kernel: Modules linked in: vhost_net tun vhost tap kvm_amd ccp kvm xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 ip6table_filter ip6_tables wireguard ip6_udp_tunnel udp_tunnel iptable_raw iptable_mangle xt_nat veth macvlan ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs nfsd lockd grace sunrpc md_mod e1000e igb(O) edac_mce_amd crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mpt3sas pcbc wmi_bmof mxm_wmi aesni_intel aes_x86_64 crypto_simd i2c_piix4 cryptd i2c_core nvme raid_class ahci k10temp libahci scsi_transport_sas glue_helper nvme_core wmi button [last unloaded: ccp]
Jun 21 09:19:22 kernel: CPU: 11 PID: 28598 Comm: runc Tainted: G           O      4.19.107-Unraid #1
Jun 21 09:19:22 kernel: Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019
Jun 21 09:19:22 kernel: RIP: 0010:dev_watchdog+0x161/0x1bb
Jun 21 09:19:22 kernel: Code: 5f 94 00 00 75 39 48 89 ef c6 05 4e 5f 94 00 01 e8 a1 a8 fd ff 44 89 e9 48 89 ee 48 c7 c7 57 2a da 81 48 89 c2 e8 cd 0b af ff <0f> 0b eb 11 41 ff c5 48 81 c2 40 01 00 00 41 39 cd 75 95 eb 13 48
Jun 21 09:19:22 kernel: RSP: 0018:ffff8887fe8c3ea0 EFLAGS: 00010286
Jun 21 09:19:22 kernel: RAX: 0000000000000000 RBX: ffff8887f770e438 RCX: 0000000000000007
Jun 21 09:19:22 kernel: RDX: 0000000000000b7e RSI: 0000000000000002 RDI: ffff8887fe8d64f0
Jun 21 09:19:22 kernel: RBP: ffff8887f770e000 R08: 0000000000000003 R09: 0000000000000400
Jun 21 09:19:22 kernel: R10: 0000000000000000 R11: 0000000000000058 R12: ffff8887f770e41c
Jun 21 09:19:22 kernel: R13: 0000000000000000 R14: ffff8887f6f06940 R15: 000000000000000b
Jun 21 09:19:22 kernel: FS:  0000000000d6a880(0000) GS:ffff8887fe8c0000(0000) knlGS:0000000000000000
Jun 21 09:19:22 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 21 09:19:22 kernel: CR2: 000014fd54626880 CR3: 0000000242d90000 CR4: 00000000003406e0
Jun 21 09:19:22 kernel: Call Trace:
Jun 21 09:19:22 kernel: 
Jun 21 09:19:22 kernel: call_timer_fn+0x18/0x7b
Jun 21 09:19:22 kernel: ? qdisc_reset+0xc0/0xc0
Jun 21 09:19:22 kernel: expire_timers+0x7e/0x8d
Jun 21 09:19:22 kernel: run_timer_softirq+0x72/0x120
Jun 21 09:19:22 kernel: ? enqueue_hrtimer.isra.0+0x23/0x27
Jun 21 09:19:22 kernel: ? __hrtimer_run_queues+0xdd/0x10b
Jun 21 09:19:22 kernel: ? ktime_get+0x44/0x95
Jun 21 09:19:22 kernel: __do_softirq+0xc9/0x1d7
Jun 21 09:19:22 kernel: irq_exit+0x5e/0x9d
Jun 21 09:19:22 kernel: smp_apic_timer_interrupt+0x80/0x93
Jun 21 09:19:22 kernel: apic_timer_interrupt+0xf/0x20
Jun 21 09:19:22 kernel: 
Jun 21 09:19:22 kernel: RIP: 0010:prepend_path+0xb1/0x205
Jun 21 09:19:22 kernel: Code: 44 24 14 41 89 c2 eb 13 49 39 c6 74 4c 4d 8b 5e 18 4c 8d 68 20 49 89 c6 4c 89 db 48 8b 44 24 08 48 3b 58 08 74 7b 49 8b 55 00 <48> 39 da 74 09 4c 8b 5b 18 49 39 db 75 45 48 39 da 49 8b 46 10 74
Jun 21 09:19:22 kernel: RSP: 0018:ffffc9000d703cf8 EFLAGS: 00000283 ORIG_RAX: ffffffffffffff13
Jun 21 09:19:22 kernel: RAX: ffff88822463f088 RBX: ffff8887a0fb5140 RCX: ffffc9000d703da8
Jun 21 09:19:22 kernel: RDX: ffff8887a0fb5140 RSI: ffff88822463f088 RDI: ffffc9000d703da8
Jun 21 09:19:22 kernel: RBP: ffffc9000d703d68 R08: 0000000000000000 R09: ffffc9000d703da8
Jun 21 09:19:22 kernel: R10: 000000000008b216 R11: ffff8887a0fb5140 R12: 000000000053cd36
Jun 21 09:19:22 kernel: R13: ffff8881726b4920 R14: ffff8881726b4900 R15: ffffc9000d703d64
Jun 21 09:19:22 kernel: ? __dentry_path.part.0+0xa7/0x115
Jun 21 09:19:22 kernel: __d_path+0x59/0x86
Jun 21 09:19:22 kernel: seq_path_root+0x40/0x95
Jun 21 09:19:22 kernel: show_mountinfo+0xc5/0x260
Jun 21 09:19:22 kernel: seq_read+0x231/0x313
Jun 21 09:19:22 kernel: __vfs_read+0x32/0x132
Jun 21 09:19:22 kernel: ? __switch_to_asm+0x41/0x70
Jun 21 09:19:22 kernel: vfs_read+0xa4/0x124
Jun 21 09:19:22 kernel: ksys_read+0x60/0xb2
Jun 21 09:19:22 kernel: do_syscall_64+0x57/0xf2
Jun 21 09:19:22 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 21 09:19:22 kernel: RIP: 0033:0x4a41c0
Jun 21 09:19:22 kernel: Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
Jun 21 09:19:22 kernel: RSP: 002b:000000c00005db10 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
Jun 21 09:19:22 kernel: RAX: ffffffffffffffda RBX: 000000c00001c000 RCX: 00000000004a41c0
Jun 21 09:19:22 kernel: RDX: 0000000000001000 RSI: 000000c0000e3000 RDI: 0000000000000004
Jun 21 09:19:22 kernel: RBP: 000000c00005db60 R08: 0000000000000000 R09: 0000000000000000
Jun 21 09:19:22 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000001
Jun 21 09:19:22 kernel: R13: 0000000000000075 R14: 0000000000895f8a R15: 0000000000000038
Jun 21 09:19:22 kernel: ---[ end trace ad1ca502756d72cd ]---

 

The NIC that it is talking about, is the onboard nic, lspci output is here:

 

07:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
        Subsystem: ASUSTeK Computer Inc. I211 Gigabit Network Connection
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 38
        Region 0: Memory at fc500000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at b000 [size=32]
        Region 3: Memory at fc520000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #7, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 4c-ed-fb-ff-ff-7a-07-47
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Kernel driver in use: igb
        Kernel modules: igb

 

If I stop all dockers from cli (I've had to write a script and do it via screen, as I keep getting disconnected), everything goes back to normal, and I can start them again fine, without rebooting the system. At least, until the next time it dies.

 

 

If anyone might be able to point me in the right direction as to what I've done wrong (and I'm sure it's me, as I made a heap of changes) I'd really appreciate it.. If I can provide more info, or clarify something, just say the word.

 

 

Edited by kharntiitar
Mark as solved
Link to comment

Thanks @johnnie.black I note two things from that: 

 

First that my alleged max supported memory speed for 2nd gen is not 3200 as per what the motherboard says (and what I've configured), I'll lower this and see how that goes.

 

Second, I also note from the link in that thread that I shouldn't be turning of c-states, and instead leave it to auto. I'll do that as well.. 

 

Will let you know how I get on tomorrow :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.