lsof Tainted / system instability


Recommended Posts

Hello folks... 

 

I've been a short term unRaid user... Initial setup was rock solid... then things broke - not sure what, why or how.

 

My config seems to have taken a turn for the worse as I'm seeing system lockups about every 2-3 days.  UI becomes unresponsive, and system will eventually tank to the point where my only option is hard power off.

 

I'm currently at that point again where my system looks like it's about to have to take the hard down button.

 

I can't grab logs (been trying for an hour) but the best I can do right now is a pastebin of my log screen.

 

https://pastebin.com/PfmnEMZ7

 

Any insight would be fantastic.

 

 

Link to comment

In the middle of my 36+ hour rebuild from the hard shutdown:

 

Oct 20 07:29:35 SAN kernel: WARNING: CPU: 16 PID: 11065 at lib/vsprintf.c:2231 vsnprintf+0x30/0x4e8
Oct 20 07:29:35 SAN kernel: Modules linked in: macvlan nvidia_uvm(O) xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap veth xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod ipmi_devintf bonding igb(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd drm glue_helper mpt3sas intel_cstate intel_uncore intel_rapl_perf agpgart nvme syscopyarea input_leds sysfillrect nvme_core led_class ipmi_ssif sysimgblt wmi i2c_i801 i2c_core ahci raid_class joydev pcc_cpufreq fb_sys_fops libahci scsi_transport_sas button acpi_pad
Oct 20 07:29:35 SAN kernel: acpi_power_meter ipmi_si [last unloaded: igb]
Oct 20 07:29:35 SAN kernel: CPU: 16 PID: 11065 Comm: lsof Tainted: P O 4.19.107-Unraid #1
Oct 20 07:29:35 SAN kernel: Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 3.2 11/22/2019
Oct 20 07:29:35 SAN kernel: RIP: 0010:vsnprintf+0x30/0x4e8
Oct 20 07:29:35 SAN kernel: Code: 41 54 55 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 48 81 fe ff ff ff 7f 48 c7 44 24 08 00 00 00 00 76 07 <0f> 0b e9 8d 04 00 00 48 89 fd 49 89 fc 49 89 f6 48 01 f5 48 89 cb
Oct 20 07:29:35 SAN kernel: RSP: 0018:ffffc9002b4d3cf0 EFLAGS: 00010212
Oct 20 07:29:35 SAN kernel: RAX: 0000000000000000 RBX: ffff888775f2be80 RCX: ffffc9002b4d3d50
Oct 20 07:29:35 SAN kernel: RDX: ffffffff81d50697 RSI: 0000000080000000 RDI: ffffc90033403000
Oct 20 07:29:35 SAN kernel: RBP: ffffc9002b4d3da0 R08: 000000000000000a R09: ffff888000000000
Oct 20 07:29:35 SAN kernel: R10: ffff88a07fffadc0 R11: ffffea001cba39c8 R12: ffff889fbaf58580
Oct 20 07:29:35 SAN kernel: R13: ffff889fbaf58600 R14: 0000000000000000 R15: 0000000000000000
Oct 20 07:29:35 SAN kernel: FS: 000015409c1a2540(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000
Oct 20 07:29:35 SAN kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 20 07:29:35 SAN kernel: CR2: 000014ba36d896a8 CR3: 0000001051794004 CR4: 00000000001626e0
Oct 20 07:29:35 SAN kernel: Call Trace:
Oct 20 07:29:35 SAN kernel: seq_vprintf+0x2b/0x3d
Oct 20 07:29:35 SAN kernel: seq_printf+0x4e/0x65
Oct 20 07:29:35 SAN kernel: seq_show+0xf5/0x13e
Oct 20 07:29:35 SAN kernel: seq_read+0x170/0x339
Oct 20 07:29:35 SAN kernel: __vfs_read+0x2e/0x134
Oct 20 07:29:35 SAN kernel: ? __se_sys_newfstat+0x3c/0x5f
Oct 20 07:29:35 SAN kernel: vfs_read+0xa1/0x122
Oct 20 07:29:35 SAN kernel: ksys_read+0x60/0xb4
Oct 20 07:29:35 SAN kernel: do_syscall_64+0x57/0xf2
Oct 20 07:29:35 SAN kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 20 07:29:35 SAN kernel: RIP: 0033:0x15409c0c282e
Oct 20 07:29:35 SAN kernel: Code: c0 e9 f6 fe ff ff 50 48 8d 3d b6 5d 0a 00 e8 e9 fd 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
Oct 20 07:29:35 SAN kernel: RSP: 002b:00007ffdb8b16c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Oct 20 07:29:35 SAN kernel: RAX: ffffffffffffffda RBX: 00000000004292c0 RCX: 000015409c0c282e
Oct 20 07:29:35 SAN kernel: RDX: 0000000000000400 RSI: 00000000004527d0 RDI: 0000000000000005
Oct 20 07:29:35 SAN kernel: RBP: 000015409c198420 R08: 0000000000000005 R09: 0000000000000000
Oct 20 07:29:35 SAN kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004292c0
Oct 20 07:29:35 SAN kernel: R13: 000015409c197820 R14: 0000000000000d68 R15: 0000000000000d68
Oct 20 07:29:35 SAN kernel: ---[ end trace e6328e640efba0a5 ]---
Oct 20 07:30:02 SAN login[15329]: ROOT LOGIN on '/dev/pts/1'
Oct 20 07:30:59 SAN kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Oct 20 07:30:59 SAN kernel: rcu: 6-....: (59999 ticks this GP) idle=1f2/1/0x4000000000000002 softirq=5976769/5977668 fqs=14467
Oct 20 07:30:59 SAN kernel: rcu: (t=60000 jiffies g=8040365 q=509963)
Oct 20 07:30:59 SAN kernel: NMI backtrace for cpu 6
Oct 20 07:30:59 SAN kernel: CPU: 6 PID: 11065 Comm: lsof Tainted: P W O 4.19.107-Unraid #1
Oct 20 07:30:59 SAN kernel: Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 3.2 11/22/2019
Oct 20 07:30:59 SAN kernel: Call Trace:
Oct 20 07:30:59 SAN kernel: <IRQ>
Oct 20 07:30:59 SAN kernel: dump_stack+0x67/0x83
Oct 20 07:30:59 SAN kernel: nmi_cpu_backtrace+0x71/0x83
Oct 20 07:30:59 SAN kernel: ? lapic_can_unplug_cpu+0x8e/0x8e
Oct 20 07:30:59 SAN kernel: nmi_trigger_cpumask_backtrace+0x57/0xd7
Oct 20 07:30:59 SAN kernel: rcu_dump_cpu_stacks+0x91/0xbb
Oct 20 07:30:59 SAN kernel: rcu_check_callbacks+0x28f/0x58e
Oct 20 07:30:59 SAN kernel: ? tick_sched_handle.isra.5+0x2f/0x2f
Oct 20 07:30:59 SAN kernel: update_process_times+0x23/0x45
Oct 20 07:30:59 SAN kernel: tick_sched_timer+0x36/0x64
Oct 20 07:30:59 SAN kernel: __hrtimer_run_queues+0xb1/0x105
Oct 20 07:30:59 SAN kernel: hrtimer_interrupt+0xf4/0x20d
Oct 20 07:30:59 SAN kernel: smp_apic_timer_interrupt+0x79/0x91
Oct 20 07:30:59 SAN kernel: apic_timer_interrupt+0xf/0x20
Oct 20 07:30:59 SAN kernel: </IRQ>
Oct 20 07:30:59 SAN kernel: RIP: 0010:vsnprintf+0x32/0x4e8
Oct 20 07:30:59 SAN kernel: Code: 55 53 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 48 81 fe ff ff ff 7f 48 c7 44 24 08 00 00 00 00 76 07 0f 0b <e9> 8d 04 00 00 48 89 fd 49 89 fc 49 89 f6 48 01 f5 48 89 cb 73 0a
Oct 20 07:30:59 SAN kernel: RSP: 0018:ffffc9002b4d3c00 EFLAGS: 00010212 ORIG_RAX: ffffffffffffff13
Oct 20 07:30:59 SAN kernel: RAX: 0000000000000000 RBX: ffff888775f2be80 RCX: ffffc9002b4d3c60
Oct 20 07:30:59 SAN kernel: RDX: ffffffff81d3bdf8 RSI: 00000000fff29951 RDI: ffffc900334d96af
Oct 20 07:30:59 SAN kernel: RBP: ffffc9002b4d3cb0 R08: 0000000000000081 R09: ffff8883921cf838
Oct 20 07:30:59 SAN kernel: R10: ffffc9002b4d3cc4 R11: ffffea0030932cc8 R12: ffffffff81d3bdf8
Oct 20 07:30:59 SAN kernel: R13: ffff8883921cf838 R14: ffff889fbb029188 R15: 0000000000000000
Oct 20 07:30:59 SAN kernel: ? invalid_op+0x14/0x20
Oct 20 07:30:59 SAN kernel: seq_vprintf+0x2b/0x3d
Oct 20 07:30:59 SAN kernel: seq_printf+0x4e/0x65
Oct 20 07:30:59 SAN kernel: ? vsnprintf+0x32/0x4e8
Oct 20 07:30:59 SAN kernel: show_mark_fhandle+0xba/0xe0
Oct 20 07:30:59 SAN kernel: ? seq_vprintf+0x2b/0x3d
Oct 20 07:30:59 SAN kernel: ? seq_printf+0x4e/0x65
Oct 20 07:30:59 SAN kernel: inotify_show_fdinfo+0x8c/0xcb
Oct 20 07:30:59 SAN kernel: seq_show+0x128/0x13e
Oct 20 07:30:59 SAN kernel: seq_read+0x170/0x339
Oct 20 07:30:59 SAN kernel: __vfs_read+0x2e/0x134
Oct 20 07:30:59 SAN kernel: ? __se_sys_newfstat+0x3c/0x5f
Oct 20 07:30:59 SAN kernel: vfs_read+0xa1/0x122
Oct 20 07:30:59 SAN kernel: ksys_read+0x60/0xb4
Oct 20 07:30:59 SAN kernel: do_syscall_64+0x57/0xf2
Oct 20 07:30:59 SAN kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 20 07:30:59 SAN kernel: RIP: 0033:0x15409c0c282e
Oct 20 07:30:59 SAN kernel: Code: c0 e9 f6 fe ff ff 50 48 8d 3d b6 5d 0a 00 e8 e9 fd 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
Oct 20 07:30:59 SAN kernel: RSP: 002b:00007ffdb8b16c18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Oct 20 07:30:59 SAN kernel: RAX: ffffffffffffffda RBX: 00000000004292c0 RCX: 000015409c0c282e
Oct 20 07:30:59 SAN kernel: RDX: 0000000000000400 RSI: 00000000004527d0 RDI: 0000000000000005
Oct 20 07:30:59 SAN kernel: RBP: 000015409c198420 R08: 0000000000000005 R09: 0000000000000000
Oct 20 07:30:59 SAN kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004292c0
Oct 20 07:30:59 SAN kernel: R13: 000015409c197820 R14: 0000000000000d68 R15: 0000000000000d68

Edited by TheSnotRocket
Link to comment
14 hours ago, TheSnotRocket said:

I can't grab logs (been trying for an hour)

How were you trying to grab them?  From the webGUI, if it doesn't complete within 120 seconds, it never will finish

 

From a command prompt,

diagnostics

and the zip file will get saved into the logs folder on the flash drive

Link to comment
23 minutes ago, Squid said:

How were you trying to grab them?  From the webGUI, if it doesn't complete within 120 seconds, it never will finish

 

From a command prompt,


diagnostics

and the zip file will get saved into the logs folder on the flash drive

 

What I posted to pastebin was from the webGUI - that's the best I could get at the time before I had to hard down.

 

Unable to run the diagnostics command while I was in the hung/hanging state.

 

Currently running in safemode - re-re-re-re-started my parity check.  ETA 1 day, 18 hours.

Link to comment

So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains:

 

CPU: 11 PID: 16390 Comm: php-fpm Tainted: P        W  O      4.19.107-Unraid #1

 

When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens.

 

san-diagnostics-20201005-1635.zip

Edited by TheSnotRocket
Link to comment

Well, just looked at your older diagnostics from the 5th of October.  Nothing stands out in the configuration.  Your syslog is spammed with multiple drive connection/resets/etc.  Not sure if it's because of an actual connection issue or maybe it's your HBA card - should look into upgrading the firmware.  Your syslog is seriously filled with those messages.  You also have a lot of kernel panics towards the end which seemingly result in an OOM condition (odd since you have so much RAM).  Maybe try booting in safemode and see if you're stable, then slowly enable dockers/VMs until you find the cause.  

Link to comment

I think the connection resets etc were due to me trying to spin down my SAS drives - I've since corrected that but will keep and eye on it.

 

I'm currently in safemode as I type this - rebuilding the array.  I'll feel more comfortable messing around when it's done.  Right now, everything (Docker and VM's) are off.

 

Link to comment
26 minutes ago, TheSnotRocket said:

So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains:

 

CPU: 11 PID: 16390 Comm: php-fpm Tainted: P        W  O      4.19.107-Unraid #1

 

When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens.

The "tainted" just means that you're using an OOT driver which isn't "officially" supported by the kernel, hence OOT.  The usual culprits are the intel igb and the nvidia drivers.

 

You may need to setup a syslog server so you can see what happens when you lose your system

Link to comment

Funny... I had igb and Nvidia installed.

 

Pulled my igb card because I wasn't actually using it and the logs reported that the card was unavailable anyway.

 

nvidia card still running in the box for plex and emby decoding.

 

What's interesting... is that I don't get that tainted message until the system becomes super unstable. After a fresh reboot and during parity check, I don't see those issues.

 

Edited by TheSnotRocket
Link to comment

Sooo.. not sure how to track this down... 

 

Fresh reboot - started a VM I have to use, and 3 dockers - binhex-Plex, EmbyServerBeta and binhex-sabnzbd.

 

System rock solid - heavy use in a VM for 15+ hours.

 

No errors in the logs at all.

 

I pulled a nzb and almost instant instability.  Errors, kernel taint messages, etc.

 

Where to look?  Something feels like when I hit the cache drives hard, I start seeing these issues.  Cache drives are SATA SSD's and are not attached to my backplane.

 

I'm now stuck back headed to a hard down situation again... ran diagnostics from the shell and let it run for an hour.  No logs.

 

 

Edited by TheSnotRocket
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.