[6.6.6] Random Hangs, Memory or Hardware Issues?


apazzy

Recommended Posts

Hi all,

 

I've been running Unraid on this same system since 2014 without any issues for the most part.

 

I upgraded from 6.6.5 to 6.7-rc2 this past Saturday but pretty much immediately started seeing system instability. Reverting to 6.6.5 did not resolve, nor did upgrading to 6.6.6 stable.

 

I've experienced something similar to this about two months ago and thought I tracked it down to 1 of 2 memory sticks and ended up taking them both out, as the issue was only present when both sticks were in. The remaining 6 of my RAM modules have been run through 7 pass memtests and came out fine individually, and in groups in various slot configurations.

 

I haven't run any memtests since, and I'm starting to think neither module I removed may be at fault but rather it's a motherboard/CPU issue.

 

Below is a snippet of the first syslog tail I was able to captured, this seems to happen some time around when the mover was running. Attached are some diagnostics and syslog tails from the past few hangs that have occurred.

 

Any assistance would be greatly appreciated.

 

FWIW, I hoped to try upgrading my hardware to Zen 2 in a few months... If there's even a band-aid someone could suggest to mitigate these hangs with what I've got I would be extremely grateful.

 

Thanks!

 

Specs:

ASUS X99 Pro USB 3.1

Intel i7-5820k

6x 8gb DDR4 2400mhz

 

Feb  6 00:56:03 ghost kernel: BUG: unable to handle kernel paging request at ffffffff81345d07
Feb  6 00:56:03 ghost kernel: PGD 1e0e067 P4D 1e0e067 PUD 1e0f063 PMD 12000e1 
Feb  6 00:56:03 ghost kernel: Oops: 0003 [#1] SMP PTI
Feb  6 00:56:03 ghost kernel: CPU: 2 PID: 20054 Comm: umount Not tainted 4.18.20-unRAID #1
Feb  6 00:56:03 ghost kernel: Hardware name: ASUS All Series/X99-PRO/USB 3.1, BIOS 3801 08/10/2017
Feb  6 00:56:03 ghost kernel: RIP: 0010:native_queued_spin_lock_slowpath+0xf7/0x16d
Feb  6 00:56:03 ghost kernel: Code: 0f b1 0f 85 c0 75 e1 eb 7b 31 c9 eb 36 c1 e9 12 83 e0 03 ff c9 48 c1 e0 04 48 63 c9 48 05 c0 17 02 00 48 03 04 cd 00 27 da 81 <48> 89 10 8b 42 08 85 c0 75 04 f3 90 eb f5 48 8b 0a 48 85 c9 74 c9 
Feb  6 00:56:03 ghost kernel: RSP: 0018:ffffc90007bfbe00 EFLAGS: 00010282
Feb  6 00:56:03 ghost kernel: RAX: ffffffff81345d07 RBX: ffff880a99f15f80 RCX: 00000000000003ff
Feb  6 00:56:03 ghost kernel: RDX: ffff880c0f2a17c0 RSI: 00000000000c0000 RDI: ffff880a98cfdb80
Feb  6 00:56:03 ghost kernel: RBP: ffff880a99f15f58 R08: 0000000000000000 R09: ffffffff811b6c00
Feb  6 00:56:03 ghost kernel: R10: ffffea002a9631c0 R11: ffff880c0f2e0c01 R12: ffff880a98cfdb00
Feb  6 00:56:03 ghost kernel: R13: ffff880a99f15f00 R14: ffffc90007bfbe58 R15: ffff880a98cfdb80
Feb  6 00:56:03 ghost kernel: FS:  0000148642bcd780(0000) GS:ffff880c0f280000(0000) knlGS:0000000000000000
Feb  6 00:56:03 ghost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  6 00:56:03 ghost kernel: CR2: ffffffff81345d07 CR3: 0000000c0379e003 CR4: 00000000001606e0
Feb  6 00:56:03 ghost kernel: Call Trace:
Feb  6 00:56:03 ghost kernel: _raw_spin_lock+0x16/0x19
Feb  6 00:56:03 ghost kernel: shrink_dentry_list+0x75/0x185
Feb  6 00:56:03 ghost kernel: shrink_dcache_parent+0x58/0x82
Feb  6 00:56:03 ghost kernel: do_one_tree+0x9/0x2c
Feb  6 00:56:03 ghost kernel: shrink_dcache_for_umount+0x31/0x65
Feb  6 00:56:03 ghost kernel: generic_shutdown_super+0x19/0x10c
Feb  6 00:56:03 ghost kernel: kill_anon_super+0x9/0xe
Feb  6 00:56:03 ghost kernel: deactivate_locked_super+0x2f/0x61
Feb  6 00:56:03 ghost kernel: cleanup_mnt+0x40/0x5c
Feb  6 00:56:03 ghost kernel: task_work_run+0x77/0x8b
Feb  6 00:56:03 ghost kernel: exit_to_usermode_loop+0x46/0x96
Feb  6 00:56:03 ghost kernel: do_syscall_64+0xdf/0xe6
Feb  6 00:56:03 ghost kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb  6 00:56:03 ghost kernel: RIP: 0033:0x148642d09897
Feb  6 00:56:03 ghost kernel: Code: 66 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 65 0c 00 f7 d8 64 89 01 48 
Feb  6 00:56:03 ghost kernel: RSP: 002b:00007ffc0d156418 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Feb  6 00:56:03 ghost kernel: RAX: 0000000000000000 RBX: 00000000006072b0 RCX: 0000148642d09897
Feb  6 00:56:03 ghost kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000006078e0
Feb  6 00:56:03 ghost kernel: RBP: 00000000006078e0 R08: 0000000000607010 R09: 0000000000000000
Feb  6 00:56:03 ghost kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
Feb  6 00:56:03 ghost kernel: R13: 0000148643487ed0 R14: 0000000000607490 R15: 0000000000000000
Feb  6 00:56:03 ghost kernel: Modules linked in: veth xt_nat macvlan xt_CHECKSUM iptable_mangle ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat ip_tables xfs md_mod tun bonding x86_pkg_temp_thermal intel_powerclamp coretemp hid_logitech_hidpp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_uncore intel_rapl_perf i2c_i801 i2c_core e1000e wmi_bmof mxm_wmi ahci hid_logitech_dj libahci wmi pcc_cpufreq button [last unloaded: vhost]
Feb  6 00:56:03 ghost kernel: CR2: ffffffff81345d07
Feb  6 00:56:03 ghost kernel: ---[ end trace aadfac1be972fcc3 ]---
Feb  6 00:56:03 ghost kernel: RIP: 0010:native_queued_spin_lock_slowpath+0xf7/0x16d
Feb  6 00:56:03 ghost kernel: Code: 0f b1 0f 85 c0 75 e1 eb 7b 31 c9 eb 36 c1 e9 12 83 e0 03 ff c9 48 c1 e0 04 48 63 c9 48 05 c0 17 02 00 48 03 04 cd 00 27 da 81 <48> 89 10 8b 42 08 85 c0 75 04 f3 90 eb f5 48 8b 0a 48 85 c9 74 c9 
Feb  6 00:56:03 ghost kernel: RSP: 0018:ffffc90007bfbe00 EFLAGS: 00010282
Feb  6 00:56:03 ghost kernel: RAX: ffffffff81345d07 RBX: ffff880a99f15f80 RCX: 00000000000003ff
Feb  6 00:56:03 ghost kernel: RDX: ffff880c0f2a17c0 RSI: 00000000000c0000 RDI: ffff880a98cfdb80
Feb  6 00:56:03 ghost kernel: RBP: ffff880a99f15f58 R08: 0000000000000000 R09: ffffffff811b6c00
Feb  6 00:56:03 ghost kernel: R10: ffffea002a9631c0 R11: ffff880c0f2e0c01 R12: ffff880a98cfdb00
Feb  6 00:56:03 ghost kernel: R13: ffff880a99f15f00 R14: ffffc90007bfbe58 R15: ffff880a98cfdb80
Feb  6 00:56:03 ghost kernel: FS:  0000148642bcd780(0000) GS:ffff880c0f280000(0000) knlGS:0000000000000000
Feb  6 00:56:03 ghost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  6 00:56:03 ghost kernel: CR2: ffffffff81345d07 CR3: 0000000c0379e003 CR4: 00000000001606e0

 

 

 

Edited by apazzy
Link to comment

Update, happened twice since.

 

Yesterday morning (2/9) I encountered a BSOD in my Windows 10 VM with the error IRQL_NOT_LESS_OR_EQUAL. I wasn't able to collect any logs for that, but I immediately enabled logging after rebooting again.

 

This time I encountered the error VM PAGE_FAULT_IN_NONPAGED_AREA that remained on my screen until reboot this morning at ~12:30am cst. This happened some time on 2/9 between 7pm and 9pm cst. Services appeared to remain running until 2/10 ~3am cst, where I received alerts from external uptime monitors I have reporting.

 

I'm continuing to investigate but I haven't been able to figure out what might be causing this. Research around these BSODs had people advising replacing the motherboard because of failing VRMs, replacing RAM (mine was tested two months ago I kept the modules that passed). Because of the mover I'm also potentially thinking it's a bad SSD or some other disk issue.

 

Any assistance would be appreciated.

 

EDIT: I just saw the 6.7-RC3 release, if I encounter another crash while the mover is disabled I might end up going that route and setting up a syslog server on my raspberry pi to collect logs (if I'm reading the release notes correctly.) If anyone can assist and recommends otherwise, let me know.

 

 

 

Edited by apazzy
Link to comment

One more crash, no diagnostics but I saved syslog via terminal....

 

I'm really at a loss, but I'm afraid the only ideas I have are to replace the motherboard/cpu/ram together...

 

Feb 11 08:08:58 ghost kernel: BUG: unable to handle kernel paging request at 0000000010000000
Feb 11 08:08:58 ghost kernel: PGD 0 P4D 0 
Feb 11 08:08:58 ghost kernel: Oops: 0000 [#2] SMP PTI
Feb 11 08:08:58 ghost kernel: CPU: 2 PID: 13709 Comm: unraidd Tainted: G      D           4.18.20-unRAID #1
Feb 11 08:08:58 ghost kernel: Hardware name: ASUS All Series/X99-PRO/USB 3.1, BIOS 3801 08/10/2017
Feb 11 08:08:58 ghost kernel: RIP: 0010:handle_stripe+0x542/0x1226 [md_mod]
Feb 11 08:08:58 ghost kernel: Code: 24 40 0f 8e a1 00 00 00 49 69 c6 c8 00 00 00 4c 8b bc 03 38 01 00 00 4d 85 ff 0f 84 81 00 00 00 f6 84 03 31 01 00 00 01 74 77 <49> 8b 17 48 85 d2 48 89 94 03 38 01 00 00 74 02 0f 0b 4a 8b 54 f3 
Feb 11 08:08:58 ghost kernel: RSP: 0018:ffffc90006f43dc8 EFLAGS: 00010202
Feb 11 08:08:58 ghost kernel: RAX: 00000000000000c8 RBX: ffff880bd269d940 RCX: ffff880bd29e2c00
Feb 11 08:08:58 ghost kernel: RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880bd269d97c
Feb 11 08:08:58 ghost kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90006f43db8
Feb 11 08:08:58 ghost kernel: R10: 0000000000000fe0 R11: ffff880bd25fb478 R12: ffff880bd269da70
Feb 11 08:08:58 ghost kernel: R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000010000000
Feb 11 08:08:58 ghost kernel: FS:  0000000000000000(0000) GS:ffff880c0f280000(0000) knlGS:0000000000000000
Feb 11 08:08:58 ghost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 08:08:58 ghost kernel: CR2: 0000000010000000 CR3: 0000000001e0a004 CR4: 00000000001626e0
Feb 11 08:08:58 ghost kernel: Call Trace:
Feb 11 08:08:58 ghost kernel: unraidd+0xbc/0x123 [md_mod]
Feb 11 08:08:58 ghost kernel: ? md_open+0x2c/0x2c [md_mod]
Feb 11 08:08:58 ghost kernel: md_thread+0xcc/0xf1 [md_mod]
Feb 11 08:08:58 ghost kernel: ? wait_woken+0x68/0x68
Feb 11 08:08:58 ghost kernel: kthread+0x10b/0x113
Feb 11 08:08:58 ghost kernel: ? kthread_flush_work_fn+0x9/0x9
Feb 11 08:08:58 ghost kernel: ret_from_fork+0x35/0x40
Feb 11 08:08:58 ghost kernel: Modules linked in: veth macvlan dm_mod dax xt_CHECKSUM iptable_mangle ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables vhost_net vhost tap xt_nat iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat ip_tables xfs md_mod tun bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel hid_logitech_hidpp pcbc aesni_intel aes_x86_64 crypto_simd cryptd wmi_bmof mxm_wmi glue_helper intel_cstate intel_uncore i2c_i801 intel_rapl_perf i2c_core e1000e hid_logitech_dj ahci libahci wmi pcc_cpufreq button
Feb 11 08:08:58 ghost kernel: CR2: 0000000010000000
Feb 11 08:08:58 ghost kernel: ---[ end trace 1439e4cee05b1063 ]---
Feb 11 08:08:58 ghost kernel: RIP: 0010:__d_lookup_rcu+0x5a/0x12f
Feb 11 08:08:58 ghost kernel: Code: 89 ea 41 89 ee d3 ea 48 8d 04 d0 48 8b 18 48 89 e8 48 c1 e8 20 48 89 04 24 48 83 e3 fe 48 85 db 0f 84 c4 00 00 00 4c 8d 6b f8 <44> 8b 63 fc 4c 39 7b 10 0f 85 aa 00 00 00 48 83 7b 08 00 0f 84 9f 
Feb 11 08:08:58 ghost kernel: RSP: 0018:ffffc90007fcfbc0 EFLAGS: 00010206
Feb 11 08:08:58 ghost kernel: RAX: 000000000000000e RBX: 0000000010000000 RCX: 0000000000000009
Feb 11 08:08:58 ghost kernel: RDX: 0000000000494168 RSI: ffffc90007fcfd50 RDI: ffff8806519130c0
Feb 11 08:08:58 ghost kernel: RBP: 0000000e9282d1e3 R08: ffffc90007fcfd50 R09: aba6e78d59af8a9c
Feb 11 08:08:58 ghost kernel: R10: ffffc90007fcfc2c R11: 8080808080808080 R12: ffffc90007fcfc88
Feb 11 08:08:58 ghost kernel: R13: 000000000ffffff8 R14: 000000009282d1e3 R15: ffff8806519130c0
Feb 11 08:08:58 ghost kernel: FS:  0000000000000000(0000) GS:ffff880c0f280000(0000) knlGS:0000000000000000
Feb 11 08:08:58 ghost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 08:08:58 ghost kernel: CR2: 0000000010000000 CR3: 0000000001e0a004 CR4: 00000000001626e0

 

Link to comment

I have tried 6.6.x a couple of times now when rc and now its stable and am getting random hangs.  Using 6.5.x have had no problems after downgrading back to keep the family happy.  Even had 60 day uptime on 6.5.x 

 

Ryzen 1700

32GB hyperx 

ROG hero motherboard

 

These lockups seem to happen mainly when system is not under load and over night.  

Edited by WABb
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.