lsof Tainted / system instability

TheSnotRocket · October 19, 2020

Hello folks...

I've been a short term unRaid user... Initial setup was rock solid... then things broke - not sure what, why or how.

My config seems to have taken a turn for the worse as I'm seeing system lockups about every 2-3 days. UI becomes unresponsive, and system will eventually tank to the point where my only option is hard power off.

I'm currently at that point again where my system looks like it's about to have to take the hard down button.

I can't grab logs (been trying for an hour) but the best I can do right now is a pastebin of my log screen.

https://pastebin.com/PfmnEMZ7

Any insight would be fantastic.

TheSnotRocket · October 19, 2020

Screen shot of htop while all this trash is going on.

I've killed off several docker images as it seems that one will peg the system - takes... 20-30-40 mins to docker kill it... then another docker image will peg the system - same story.

JorgeB · October 20, 2020

Start in safe mode and stop all dockers/VMs, run it like that for a couple of days, if no issues start turning on the other services one by one.

TheSnotRocket · October 20, 2020

In the middle of my 36+ hour rebuild from the hard shutdown:

Oct 20 07:29:35 SAN kernel: WARNING: CPU: 16 PID: 11065 at lib/vsprintf.c:2231 vsnprintf+0x30/0x4e8
Oct 20 07:29:35 SAN kernel: Modules linked in: macvlan nvidia_uvm(O) xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap veth xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod ipmi_devintf bonding igb(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd drm glue_helper mpt3sas intel_cstate intel_uncore intel_rapl_perf agpgart nvme syscopyarea input_leds sysfillrect nvme_core led_class ipmi_ssif sysimgblt wmi i2c_i801 i2c_core ahci raid_class joydev pcc_cpufreq fb_sys_fops libahci scsi_transport_sas button acpi_pad
Oct 20 07:29:35 SAN kernel: acpi_power_meter ipmi_si [last unloaded: igb]
Oct 20 07:29:35 SAN kernel: CPU: 16 PID: 11065 Comm: lsof Tainted: P O 4.19.107-Unraid #1
Oct 20 07:29:35 SAN kernel: Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 3.2 11/22/2019
Oct 20 07:29:35 SAN kernel: RIP: 0010:vsnprintf+0x30/0x4e8
Oct 20 07:29:35 SAN kernel: Code: 41 54 55 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 48 81 fe ff ff ff 7f 48 c7 44 24 08 00 00 00 00 76 07 <0f> 0b e9 8d 04 00 00 48 89 fd 49 89 fc 49 89 f6 48 01 f5 48 89 cb
Oct 20 07:29:35 SAN kernel: RSP: 0018:ffffc9002b4d3cf0 EFLAGS: 00010212
Oct 20 07:29:35 SAN kernel: RAX: 0000000000000000 RBX: ffff888775f2be80 RCX: ffffc9002b4d3d50
Oct 20 07:29:35 SAN kernel: RDX: ffffffff81d50697 RSI: 0000000080000000 RDI: ffffc90033403000
Oct 20 07:29:35 SAN kernel: RBP: ffffc9002b4d3da0 R08: 000000000000000a R09: ffff888000000000
Oct 20 07:29:35 SAN kernel: R10: ffff88a07fffadc0 R11: ffffea001cba39c8 R12: ffff889fbaf58580
Oct 20 07:29:35 SAN kernel: R13: ffff889fbaf58600 R14: 0000000000000000 R15: 0000000000000000
Oct 20 07:29:35 SAN kernel: FS: 000015409c1a2540(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000
Oct 20 07:29:35 SAN kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 20 07:29:35 SAN kernel: CR2: 000014ba36d896a8 CR3: 0000001051794004 CR4: 00000000001626e0
Oct 20 07:29:35 SAN kernel: Call Trace:
Oct 20 07:29:35 SAN kernel: seq_vprintf+0x2b/0x3d
Oct 20 07:29:35 SAN kernel: seq_printf+0x4e/0x65
Oct 20 07:29:35 SAN kernel: seq_show+0xf5/0x13e
Oct 20 07:29:35 SAN kernel: seq_read+0x170/0x339
Oct 20 07:29:35 SAN kernel: __vfs_read+0x2e/0x134
Oct 20 07:29:35 SAN kernel: ? __se_sys_newfstat+0x3c/0x5f
Oct 20 07:29:35 SAN kernel: vfs_read+0xa1/0x122
Oct 20 07:29:35 SAN kernel: ksys_read+0x60/0xb4
Oct 20 07:29:35 SAN kernel: do_syscall_64+0x57/0xf2
Oct 20 07:29:35 SAN kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 20 07:29:35 SAN kernel: RIP: 0033:0x15409c0c282e
Oct 20 07:29:35 SAN kernel: Code: c0 e9 f6 fe ff ff 50 48 8d 3d b6 5d 0a 00 e8 e9 fd 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
Oct 20 07:29:35 SAN kernel: RSP: 002b:00007ffdb8b16c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Oct 20 07:29:35 SAN kernel: RAX: ffffffffffffffda RBX: 00000000004292c0 RCX: 000015409c0c282e
Oct 20 07:29:35 SAN kernel: RDX: 0000000000000400 RSI: 00000000004527d0 RDI: 0000000000000005
Oct 20 07:29:35 SAN kernel: RBP: 000015409c198420 R08: 0000000000000005 R09: 0000000000000000
Oct 20 07:29:35 SAN kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004292c0
Oct 20 07:29:35 SAN kernel: R13: 000015409c197820 R14: 0000000000000d68 R15: 0000000000000d68
Oct 20 07:29:35 SAN kernel: ---[ end trace e6328e640efba0a5 ]---
Oct 20 07:30:02 SAN login[15329]: ROOT LOGIN on '/dev/pts/1'
Oct 20 07:30:59 SAN kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Oct 20 07:30:59 SAN kernel: rcu: 6-....: (59999 ticks this GP) idle=1f2/1/0x4000000000000002 softirq=5976769/5977668 fqs=14467
Oct 20 07:30:59 SAN kernel: rcu: (t=60000 jiffies g=8040365 q=509963)
Oct 20 07:30:59 SAN kernel: NMI backtrace for cpu 6
Oct 20 07:30:59 SAN kernel: CPU: 6 PID: 11065 Comm: lsof Tainted: P W O 4.19.107-Unraid #1
Oct 20 07:30:59 SAN kernel: Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 3.2 11/22/2019
Oct 20 07:30:59 SAN kernel: Call Trace:
Oct 20 07:30:59 SAN kernel: <IRQ>
Oct 20 07:30:59 SAN kernel: dump_stack+0x67/0x83
Oct 20 07:30:59 SAN kernel: nmi_cpu_backtrace+0x71/0x83
Oct 20 07:30:59 SAN kernel: ? lapic_can_unplug_cpu+0x8e/0x8e
Oct 20 07:30:59 SAN kernel: nmi_trigger_cpumask_backtrace+0x57/0xd7
Oct 20 07:30:59 SAN kernel: rcu_dump_cpu_stacks+0x91/0xbb
Oct 20 07:30:59 SAN kernel: rcu_check_callbacks+0x28f/0x58e
Oct 20 07:30:59 SAN kernel: ? tick_sched_handle.isra.5+0x2f/0x2f
Oct 20 07:30:59 SAN kernel: update_process_times+0x23/0x45
Oct 20 07:30:59 SAN kernel: tick_sched_timer+0x36/0x64
Oct 20 07:30:59 SAN kernel: __hrtimer_run_queues+0xb1/0x105
Oct 20 07:30:59 SAN kernel: hrtimer_interrupt+0xf4/0x20d
Oct 20 07:30:59 SAN kernel: smp_apic_timer_interrupt+0x79/0x91
Oct 20 07:30:59 SAN kernel: apic_timer_interrupt+0xf/0x20
Oct 20 07:30:59 SAN kernel: </IRQ>
Oct 20 07:30:59 SAN kernel: RIP: 0010:vsnprintf+0x32/0x4e8
Oct 20 07:30:59 SAN kernel: Code: 55 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 48 81 fe ff ff ff 7f 48 c7 44 24 08 00 00 00 00 76 07 0f 0b <e9> 8d 04 00 00 48 89 fd 49 89 fc 49 89 f6 48 01 f5 48 89 cb 73 0a
Oct 20 07:30:59 SAN kernel: RSP: 0018:ffffc9002b4d3c00 EFLAGS: 00010212 ORIG_RAX: ffffffffffffff13
Oct 20 07:30:59 SAN kernel: RAX: 0000000000000000 RBX: ffff888775f2be80 RCX: ffffc9002b4d3c60
Oct 20 07:30:59 SAN kernel: RDX: ffffffff81d3bdf8 RSI: 00000000fff29951 RDI: ffffc900334d96af
Oct 20 07:30:59 SAN kernel: RBP: ffffc9002b4d3cb0 R08: 0000000000000081 R09: ffff8883921cf838
Oct 20 07:30:59 SAN kernel: R10: ffffc9002b4d3cc4 R11: ffffea0030932cc8 R12: ffffffff81d3bdf8
Oct 20 07:30:59 SAN kernel: R13: ffff8883921cf838 R14: ffff889fbb029188 R15: 0000000000000000
Oct 20 07:30:59 SAN kernel: ? invalid_op+0x14/0x20
Oct 20 07:30:59 SAN kernel: seq_vprintf+0x2b/0x3d
Oct 20 07:30:59 SAN kernel: seq_printf+0x4e/0x65
Oct 20 07:30:59 SAN kernel: ? vsnprintf+0x32/0x4e8
Oct 20 07:30:59 SAN kernel: show_mark_fhandle+0xba/0xe0
Oct 20 07:30:59 SAN kernel: ? seq_vprintf+0x2b/0x3d
Oct 20 07:30:59 SAN kernel: ? seq_printf+0x4e/0x65
Oct 20 07:30:59 SAN kernel: inotify_show_fdinfo+0x8c/0xcb
Oct 20 07:30:59 SAN kernel: seq_show+0x128/0x13e
Oct 20 07:30:59 SAN kernel: seq_read+0x170/0x339
Oct 20 07:30:59 SAN kernel: __vfs_read+0x2e/0x134
Oct 20 07:30:59 SAN kernel: ? __se_sys_newfstat+0x3c/0x5f
Oct 20 07:30:59 SAN kernel: vfs_read+0xa1/0x122
Oct 20 07:30:59 SAN kernel: ksys_read+0x60/0xb4
Oct 20 07:30:59 SAN kernel: do_syscall_64+0x57/0xf2
Oct 20 07:30:59 SAN kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 20 07:30:59 SAN kernel: RIP: 0033:0x15409c0c282e
Oct 20 07:30:59 SAN kernel: Code: c0 e9 f6 fe ff ff 50 48 8d 3d b6 5d 0a 00 e8 e9 fd 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
Oct 20 07:30:59 SAN kernel: RSP: 002b:00007ffdb8b16c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Oct 20 07:30:59 SAN kernel: RAX: ffffffffffffffda RBX: 00000000004292c0 RCX: 000015409c0c282e
Oct 20 07:30:59 SAN kernel: RDX: 0000000000000400 RSI: 00000000004527d0 RDI: 0000000000000005
Oct 20 07:30:59 SAN kernel: RBP: 000015409c198420 R08: 0000000000000005 R09: 0000000000000000
Oct 20 07:30:59 SAN kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004292c0
Oct 20 07:30:59 SAN kernel: R13: 000015409c197820 R14: 0000000000000d68 R15: 0000000000000d68

Edited October 20, 2020 by TheSnotRocket

Squid · October 20, 2020

14 hours ago, TheSnotRocket said:

I can't grab logs (been trying for an hour)

How were you trying to grab them? From the webGUI, if it doesn't complete within 120 seconds, it never will finish

From a command prompt,

diagnostics

and the zip file will get saved into the logs folder on the flash drive

TheSnotRocket · October 20, 2020

23 minutes ago, Squid said:
How were you trying to grab them? From the webGUI, if it doesn't complete within 120 seconds, it never will finish

From a command prompt,
diagnostics
and the zip file will get saved into the logs folder on the flash drive

What I posted to pastebin was from the webGUI - that's the best I could get at the time before I had to hard down.

Unable to run the diagnostics command while I was in the hung/hanging state.

Currently running in safemode - re-re-re-re-started my parity check. ETA 1 day, 18 hours.

civic95man · October 20, 2020

Still, diagnostics right now could help shed some light on why it hung originally. At least give us an idea of your hardware (AMD needs some specific workarounds for example)

TheSnotRocket · October 20, 2020

So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains:

CPU: 11 PID: 16390 Comm: php-fpm Tainted: P W O 4.19.107-Unraid #1

When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens.

san-diagnostics-20201005-1635.zip

Edited October 20, 2020 by TheSnotRocket

TheSnotRocket · October 20, 2020

Here is my current diagnostics while running in safemode:

san-diagnostics-20201020-0943.zip

TheSnotRocket · October 20, 2020

Config - ish:

X10SRH-CF

Xeon E5-2660 V3

128gb

24+2 disks, 215/2tb total.

NVME unassigned for a VM.

What else would you like to know or what additional information can I provide?

Edited October 20, 2020 by TheSnotRocket

civic95man · October 20, 2020

Well, just looked at your older diagnostics from the 5th of October. Nothing stands out in the configuration. Your syslog is spammed with multiple drive connection/resets/etc. Not sure if it's because of an actual connection issue or maybe it's your HBA card - should look into upgrading the firmware. Your syslog is seriously filled with those messages. You also have a lot of kernel panics towards the end which seemingly result in an OOM condition (odd since you have so much RAM). Maybe try booting in safemode and see if you're stable, then slowly enable dockers/VMs until you find the cause.

TheSnotRocket · October 20, 2020

I think the connection resets etc were due to me trying to spin down my SAS drives - I've since corrected that but will keep and eye on it.

I'm currently in safemode as I type this - rebuilding the array. I'll feel more comfortable messing around when it's done. Right now, everything (Docker and VM's) are off.

civic95man · October 20, 2020

26 minutes ago, TheSnotRocket said:

So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains:

CPU: 11 PID: 16390 Comm: php-fpm Tainted: P W O 4.19.107-Unraid #1

When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens.

The "tainted" just means that you're using an OOT driver which isn't "officially" supported by the kernel, hence OOT. The usual culprits are the intel igb and the nvidia drivers.

You may need to setup a syslog server so you can see what happens when you lose your system

TheSnotRocket · October 20, 2020

Funny... I had igb and Nvidia installed.

Pulled my igb card because I wasn't actually using it and the logs reported that the card was unavailable anyway.

nvidia card still running in the box for plex and emby decoding.

What's interesting... is that I don't get that tainted message until the system becomes super unstable. After a fresh reboot and during parity check, I don't see those issues.

Edited October 20, 2020 by TheSnotRocket

TheSnotRocket · October 21, 2020

Sooo.. not sure how to track this down...

Fresh reboot - started a VM I have to use, and 3 dockers - binhex-Plex, EmbyServerBeta and binhex-sabnzbd.

System rock solid - heavy use in a VM for 15+ hours.

No errors in the logs at all.

I pulled a nzb and almost instant instability. Errors, kernel taint messages, etc.

Where to look? Something feels like when I hit the cache drives hard, I start seeing these issues. Cache drives are SATA SSD's and are not attached to my backplane.

I'm now stuck back headed to a hard down situation again... ran diagnostics from the shell and let it run for an hour. No logs.

Edited October 21, 2020 by TheSnotRocket

TheSnotRocket · October 21, 2020

95% sure my issue is with something binhex-sabnzbd related

TheSnotRocket · October 21, 2020

Changed from binhex-sabnzbd (binhex/arch-sabnzbd) to sabnzbd (linuxserver/sabnzbd) and have pulled down 200+GB@110MB/sec at this point..

ZERO issues.

I would have tanked the server by now with binhex-sabnzbd

lsof Tainted / system instability

Recommended Posts

TheSnotRocket

Link to comment

TheSnotRocket

Link to comment

JorgeB

Link to comment

TheSnotRocket

Link to comment

Squid

Link to comment

TheSnotRocket

Link to comment

civic95man

Link to comment

TheSnotRocket

Link to comment

TheSnotRocket

Link to comment

TheSnotRocket

Link to comment

civic95man

Link to comment

TheSnotRocket

Link to comment

civic95man

Link to comment

TheSnotRocket

Link to comment

TheSnotRocket

Link to comment

TheSnotRocket

Link to comment

TheSnotRocket

Link to comment

Join the conversation