TheSnotRocket Posted October 19, 2020 Share Posted October 19, 2020 Hello folks... I've been a short term unRaid user... Initial setup was rock solid... then things broke - not sure what, why or how. My config seems to have taken a turn for the worse as I'm seeing system lockups about every 2-3 days. UI becomes unresponsive, and system will eventually tank to the point where my only option is hard power off. I'm currently at that point again where my system looks like it's about to have to take the hard down button. I can't grab logs (been trying for an hour) but the best I can do right now is a pastebin of my log screen. https://pastebin.com/PfmnEMZ7 Any insight would be fantastic. Quote Link to comment
TheSnotRocket Posted October 19, 2020 Author Share Posted October 19, 2020 Screen shot of htop while all this trash is going on. I've killed off several docker images as it seems that one will peg the system - takes... 20-30-40 mins to docker kill it... then another docker image will peg the system - same story. Quote Link to comment
JorgeB Posted October 20, 2020 Share Posted October 20, 2020 Start in safe mode and stop all dockers/VMs, run it like that for a couple of days, if no issues start turning on the other services one by one. Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 (edited) In the middle of my 36+ hour rebuild from the hard shutdown: Oct 20 07:29:35 SAN kernel: WARNING: CPU: 16 PID: 11065 at lib/vsprintf.c:2231 vsnprintf+0x30/0x4e8 Oct 20 07:29:35 SAN kernel: Modules linked in: macvlan nvidia_uvm(O) xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap veth xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod ipmi_devintf bonding igb(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd drm glue_helper mpt3sas intel_cstate intel_uncore intel_rapl_perf agpgart nvme syscopyarea input_leds sysfillrect nvme_core led_class ipmi_ssif sysimgblt wmi i2c_i801 i2c_core ahci raid_class joydev pcc_cpufreq fb_sys_fops libahci scsi_transport_sas button acpi_pad Oct 20 07:29:35 SAN kernel: acpi_power_meter ipmi_si [last unloaded: igb] Oct 20 07:29:35 SAN kernel: CPU: 16 PID: 11065 Comm: lsof Tainted: P O 4.19.107-Unraid #1 Oct 20 07:29:35 SAN kernel: Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 3.2 11/22/2019 Oct 20 07:29:35 SAN kernel: RIP: 0010:vsnprintf+0x30/0x4e8 Oct 20 07:29:35 SAN kernel: Code: 41 54 55 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 48 81 fe ff ff ff 7f 48 c7 44 24 08 00 00 00 00 76 07 <0f> 0b e9 8d 04 00 00 48 89 fd 49 89 fc 49 89 f6 48 01 f5 48 89 cb Oct 20 07:29:35 SAN kernel: RSP: 0018:ffffc9002b4d3cf0 EFLAGS: 00010212 Oct 20 07:29:35 SAN kernel: RAX: 0000000000000000 RBX: ffff888775f2be80 RCX: ffffc9002b4d3d50 Oct 20 07:29:35 SAN kernel: RDX: ffffffff81d50697 RSI: 0000000080000000 RDI: ffffc90033403000 Oct 20 07:29:35 SAN kernel: RBP: ffffc9002b4d3da0 R08: 000000000000000a R09: ffff888000000000 Oct 20 07:29:35 SAN kernel: R10: ffff88a07fffadc0 R11: ffffea001cba39c8 R12: ffff889fbaf58580 Oct 20 07:29:35 SAN kernel: R13: ffff889fbaf58600 R14: 0000000000000000 R15: 0000000000000000 Oct 20 07:29:35 SAN kernel: FS: 000015409c1a2540(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000 Oct 20 07:29:35 SAN kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 20 07:29:35 SAN kernel: CR2: 000014ba36d896a8 CR3: 0000001051794004 CR4: 00000000001626e0 Oct 20 07:29:35 SAN kernel: Call Trace: Oct 20 07:29:35 SAN kernel: seq_vprintf+0x2b/0x3d Oct 20 07:29:35 SAN kernel: seq_printf+0x4e/0x65 Oct 20 07:29:35 SAN kernel: seq_show+0xf5/0x13e Oct 20 07:29:35 SAN kernel: seq_read+0x170/0x339 Oct 20 07:29:35 SAN kernel: __vfs_read+0x2e/0x134 Oct 20 07:29:35 SAN kernel: ? __se_sys_newfstat+0x3c/0x5f Oct 20 07:29:35 SAN kernel: vfs_read+0xa1/0x122 Oct 20 07:29:35 SAN kernel: ksys_read+0x60/0xb4 Oct 20 07:29:35 SAN kernel: do_syscall_64+0x57/0xf2 Oct 20 07:29:35 SAN kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Oct 20 07:29:35 SAN kernel: RIP: 0033:0x15409c0c282e Oct 20 07:29:35 SAN kernel: Code: c0 e9 f6 fe ff ff 50 48 8d 3d b6 5d 0a 00 e8 e9 fd 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 Oct 20 07:29:35 SAN kernel: RSP: 002b:00007ffdb8b16c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Oct 20 07:29:35 SAN kernel: RAX: ffffffffffffffda RBX: 00000000004292c0 RCX: 000015409c0c282e Oct 20 07:29:35 SAN kernel: RDX: 0000000000000400 RSI: 00000000004527d0 RDI: 0000000000000005 Oct 20 07:29:35 SAN kernel: RBP: 000015409c198420 R08: 0000000000000005 R09: 0000000000000000 Oct 20 07:29:35 SAN kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004292c0 Oct 20 07:29:35 SAN kernel: R13: 000015409c197820 R14: 0000000000000d68 R15: 0000000000000d68 Oct 20 07:29:35 SAN kernel: ---[ end trace e6328e640efba0a5 ]--- Oct 20 07:30:02 SAN login[15329]: ROOT LOGIN on '/dev/pts/1' Oct 20 07:30:59 SAN kernel: rcu: INFO: rcu_sched self-detected stall on CPU Oct 20 07:30:59 SAN kernel: rcu: 6-....: (59999 ticks this GP) idle=1f2/1/0x4000000000000002 softirq=5976769/5977668 fqs=14467 Oct 20 07:30:59 SAN kernel: rcu: (t=60000 jiffies g=8040365 q=509963) Oct 20 07:30:59 SAN kernel: NMI backtrace for cpu 6 Oct 20 07:30:59 SAN kernel: CPU: 6 PID: 11065 Comm: lsof Tainted: P W O 4.19.107-Unraid #1 Oct 20 07:30:59 SAN kernel: Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 3.2 11/22/2019 Oct 20 07:30:59 SAN kernel: Call Trace: Oct 20 07:30:59 SAN kernel: <IRQ> Oct 20 07:30:59 SAN kernel: dump_stack+0x67/0x83 Oct 20 07:30:59 SAN kernel: nmi_cpu_backtrace+0x71/0x83 Oct 20 07:30:59 SAN kernel: ? lapic_can_unplug_cpu+0x8e/0x8e Oct 20 07:30:59 SAN kernel: nmi_trigger_cpumask_backtrace+0x57/0xd7 Oct 20 07:30:59 SAN kernel: rcu_dump_cpu_stacks+0x91/0xbb Oct 20 07:30:59 SAN kernel: rcu_check_callbacks+0x28f/0x58e Oct 20 07:30:59 SAN kernel: ? tick_sched_handle.isra.5+0x2f/0x2f Oct 20 07:30:59 SAN kernel: update_process_times+0x23/0x45 Oct 20 07:30:59 SAN kernel: tick_sched_timer+0x36/0x64 Oct 20 07:30:59 SAN kernel: __hrtimer_run_queues+0xb1/0x105 Oct 20 07:30:59 SAN kernel: hrtimer_interrupt+0xf4/0x20d Oct 20 07:30:59 SAN kernel: smp_apic_timer_interrupt+0x79/0x91 Oct 20 07:30:59 SAN kernel: apic_timer_interrupt+0xf/0x20 Oct 20 07:30:59 SAN kernel: </IRQ> Oct 20 07:30:59 SAN kernel: RIP: 0010:vsnprintf+0x32/0x4e8 Oct 20 07:30:59 SAN kernel: Code: 55 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 48 81 fe ff ff ff 7f 48 c7 44 24 08 00 00 00 00 76 07 0f 0b <e9> 8d 04 00 00 48 89 fd 49 89 fc 49 89 f6 48 01 f5 48 89 cb 73 0a Oct 20 07:30:59 SAN kernel: RSP: 0018:ffffc9002b4d3c00 EFLAGS: 00010212 ORIG_RAX: ffffffffffffff13 Oct 20 07:30:59 SAN kernel: RAX: 0000000000000000 RBX: ffff888775f2be80 RCX: ffffc9002b4d3c60 Oct 20 07:30:59 SAN kernel: RDX: ffffffff81d3bdf8 RSI: 00000000fff29951 RDI: ffffc900334d96af Oct 20 07:30:59 SAN kernel: RBP: ffffc9002b4d3cb0 R08: 0000000000000081 R09: ffff8883921cf838 Oct 20 07:30:59 SAN kernel: R10: ffffc9002b4d3cc4 R11: ffffea0030932cc8 R12: ffffffff81d3bdf8 Oct 20 07:30:59 SAN kernel: R13: ffff8883921cf838 R14: ffff889fbb029188 R15: 0000000000000000 Oct 20 07:30:59 SAN kernel: ? invalid_op+0x14/0x20 Oct 20 07:30:59 SAN kernel: seq_vprintf+0x2b/0x3d Oct 20 07:30:59 SAN kernel: seq_printf+0x4e/0x65 Oct 20 07:30:59 SAN kernel: ? vsnprintf+0x32/0x4e8 Oct 20 07:30:59 SAN kernel: show_mark_fhandle+0xba/0xe0 Oct 20 07:30:59 SAN kernel: ? seq_vprintf+0x2b/0x3d Oct 20 07:30:59 SAN kernel: ? seq_printf+0x4e/0x65 Oct 20 07:30:59 SAN kernel: inotify_show_fdinfo+0x8c/0xcb Oct 20 07:30:59 SAN kernel: seq_show+0x128/0x13e Oct 20 07:30:59 SAN kernel: seq_read+0x170/0x339 Oct 20 07:30:59 SAN kernel: __vfs_read+0x2e/0x134 Oct 20 07:30:59 SAN kernel: ? __se_sys_newfstat+0x3c/0x5f Oct 20 07:30:59 SAN kernel: vfs_read+0xa1/0x122 Oct 20 07:30:59 SAN kernel: ksys_read+0x60/0xb4 Oct 20 07:30:59 SAN kernel: do_syscall_64+0x57/0xf2 Oct 20 07:30:59 SAN kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Oct 20 07:30:59 SAN kernel: RIP: 0033:0x15409c0c282e Oct 20 07:30:59 SAN kernel: Code: c0 e9 f6 fe ff ff 50 48 8d 3d b6 5d 0a 00 e8 e9 fd 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 Oct 20 07:30:59 SAN kernel: RSP: 002b:00007ffdb8b16c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Oct 20 07:30:59 SAN kernel: RAX: ffffffffffffffda RBX: 00000000004292c0 RCX: 000015409c0c282e Oct 20 07:30:59 SAN kernel: RDX: 0000000000000400 RSI: 00000000004527d0 RDI: 0000000000000005 Oct 20 07:30:59 SAN kernel: RBP: 000015409c198420 R08: 0000000000000005 R09: 0000000000000000 Oct 20 07:30:59 SAN kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004292c0 Oct 20 07:30:59 SAN kernel: R13: 000015409c197820 R14: 0000000000000d68 R15: 0000000000000d68 Edited October 20, 2020 by TheSnotRocket Quote Link to comment
Squid Posted October 20, 2020 Share Posted October 20, 2020 14 hours ago, TheSnotRocket said: I can't grab logs (been trying for an hour) How were you trying to grab them? From the webGUI, if it doesn't complete within 120 seconds, it never will finish From a command prompt, diagnostics and the zip file will get saved into the logs folder on the flash drive Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 23 minutes ago, Squid said: How were you trying to grab them? From the webGUI, if it doesn't complete within 120 seconds, it never will finish From a command prompt, diagnostics and the zip file will get saved into the logs folder on the flash drive What I posted to pastebin was from the webGUI - that's the best I could get at the time before I had to hard down. Unable to run the diagnostics command while I was in the hung/hanging state. Currently running in safemode - re-re-re-re-started my parity check. ETA 1 day, 18 hours. Quote Link to comment
civic95man Posted October 20, 2020 Share Posted October 20, 2020 Still, diagnostics right now could help shed some light on why it hung originally. At least give us an idea of your hardware (AMD needs some specific workarounds for example) Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 (edited) So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains: CPU: 11 PID: 16390 Comm: php-fpm Tainted: P W O 4.19.107-Unraid #1 When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens. san-diagnostics-20201005-1635.zip Edited October 20, 2020 by TheSnotRocket Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 Here is my current diagnostics while running in safemode: san-diagnostics-20201020-0943.zip Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 (edited) Config - ish: X10SRH-CF Xeon E5-2660 V3 128gb 24+2 disks, 215/2tb total. NVME unassigned for a VM. What else would you like to know or what additional information can I provide? Edited October 20, 2020 by TheSnotRocket Quote Link to comment
civic95man Posted October 20, 2020 Share Posted October 20, 2020 Well, just looked at your older diagnostics from the 5th of October. Nothing stands out in the configuration. Your syslog is spammed with multiple drive connection/resets/etc. Not sure if it's because of an actual connection issue or maybe it's your HBA card - should look into upgrading the firmware. Your syslog is seriously filled with those messages. You also have a lot of kernel panics towards the end which seemingly result in an OOM condition (odd since you have so much RAM). Maybe try booting in safemode and see if you're stable, then slowly enable dockers/VMs until you find the cause. Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 I think the connection resets etc were due to me trying to spin down my SAS drives - I've since corrected that but will keep and eye on it. I'm currently in safemode as I type this - rebuilding the array. I'll feel more comfortable messing around when it's done. Right now, everything (Docker and VM's) are off. Quote Link to comment
civic95man Posted October 20, 2020 Share Posted October 20, 2020 26 minutes ago, TheSnotRocket said: So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains: CPU: 11 PID: 16390 Comm: php-fpm Tainted: P W O 4.19.107-Unraid #1 When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens. The "tainted" just means that you're using an OOT driver which isn't "officially" supported by the kernel, hence OOT. The usual culprits are the intel igb and the nvidia drivers. You may need to setup a syslog server so you can see what happens when you lose your system Quote Link to comment
TheSnotRocket Posted October 20, 2020 Author Share Posted October 20, 2020 (edited) Funny... I had igb and Nvidia installed. Pulled my igb card because I wasn't actually using it and the logs reported that the card was unavailable anyway. nvidia card still running in the box for plex and emby decoding. What's interesting... is that I don't get that tainted message until the system becomes super unstable. After a fresh reboot and during parity check, I don't see those issues. Edited October 20, 2020 by TheSnotRocket Quote Link to comment
TheSnotRocket Posted October 21, 2020 Author Share Posted October 21, 2020 (edited) Sooo.. not sure how to track this down... Fresh reboot - started a VM I have to use, and 3 dockers - binhex-Plex, EmbyServerBeta and binhex-sabnzbd. System rock solid - heavy use in a VM for 15+ hours. No errors in the logs at all. I pulled a nzb and almost instant instability. Errors, kernel taint messages, etc. Where to look? Something feels like when I hit the cache drives hard, I start seeing these issues. Cache drives are SATA SSD's and are not attached to my backplane. I'm now stuck back headed to a hard down situation again... ran diagnostics from the shell and let it run for an hour. No logs. Edited October 21, 2020 by TheSnotRocket Quote Link to comment
TheSnotRocket Posted October 21, 2020 Author Share Posted October 21, 2020 95% sure my issue is with something binhex-sabnzbd related Quote Link to comment
TheSnotRocket Posted October 21, 2020 Author Share Posted October 21, 2020 Changed from binhex-sabnzbd (binhex/arch-sabnzbd) to sabnzbd (linuxserver/sabnzbd) and have pulled down 200+GB@110MB/sec at this point.. ZERO issues. I would have tanked the server by now with binhex-sabnzbd Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.