Please help diagnose system lockup root cause


Recommended Posts

unRAID 6.7.2

Plugins shown in diagnostics

Hardware listed in my signature
 

My system locked up today while I wasn't actively interacting with it.  I see OOM errors in the syslog but not sure how to diagnose what was causing them.  I've been having issues lately with the system locking up randomly.  Yesterday I noticed call traces in the syslog but was unable to figure out what they were caused by - see example below:

 

Jul 22 13:02:44 TowerMediaServ kernel: ------------[ cut here ]------------
Jul 22 13:02:44 TowerMediaServ kernel: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jul 22 13:02:44 TowerMediaServ kernel: WARNING: CPU: 5 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x15f/0x1b7
Jul 22 13:02:44 TowerMediaServ kernel: Modules linked in: xt_nat veth xt_CHECKSUM ipt_MASQUERADE ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter ip_tables vhost_net tun vhost tap arc4 ecb md4 sha512_ssse3 sha512_generic cmac cifs ccm xfs md_mod nct6775 hwmon_vid bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper hid_logitech_hidpp intel_cstate intel_uncore intel_rapl_perf ahci libahci pcc_cpufreq ie31200_edac i2c_i801 hid_logitech_dj video button r8169 ftdi_sio i2c_core usbserial cdc_acm realtek 3w_9xxx backlight
Jul 22 13:02:44 TowerMediaServ kernel: CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.19.56-Unraid #1
Jul 22 13:02:44 TowerMediaServ kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro3, BIOS P2.10 07/12/2013
Jul 22 13:02:44 TowerMediaServ kernel: RIP: 0010:dev_watchdog+0x15f/0x1b7
Jul 22 13:02:44 TowerMediaServ kernel: Code: 0b 06 97 00 00 75 36 4c 89 ef c6 05 ff 05 97 00 01 e8 8f b3 fd ff 89 e9 4c 89 ee 48 c7 c7 3e df d8 81 48 89 c2 e8 48 cd b1 ff <0f> 0b eb 0f ff c5 48 81 c2 40 01 00 00 39 cd 75 98 eb 13 48 8b 83
Jul 22 13:02:44 TowerMediaServ kernel: RSP: 0018:ffff88841f543ea0 EFLAGS: 00010286
Jul 22 13:02:44 TowerMediaServ kernel: RAX: 0000000000000000 RBX: ffff88841c9b4438 RCX: 0000000000000007
Jul 22 13:02:44 TowerMediaServ kernel: RDX: 000000000000096f RSI: 0000000000000002 RDI: ffff88841f5564f0
Jul 22 13:02:44 TowerMediaServ kernel: RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000020300
Jul 22 13:02:44 TowerMediaServ kernel: R10: 000000000000096e R11: 0000000000013510 R12: ffff88841c9b441c
Jul 22 13:02:44 TowerMediaServ kernel: R13: ffff88841c9b4000 R14: ffff888418adc080 R15: 0000000000000005
Jul 22 13:02:44 TowerMediaServ kernel: FS:  0000000000000000(0000) GS:ffff88841f540000(0000) knlGS:0000000000000000
Jul 22 13:02:44 TowerMediaServ kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 13:02:44 TowerMediaServ kernel: CR2: 000000000572fcb0 CR3: 0000000001e0a002 CR4: 00000000001626e0
Jul 22 13:02:44 TowerMediaServ kernel: Call Trace:
Jul 22 13:02:44 TowerMediaServ kernel: <IRQ>
Jul 22 13:02:44 TowerMediaServ kernel: call_timer_fn+0x18/0x7b
Jul 22 13:02:44 TowerMediaServ kernel: ? qdisc_reset+0xc0/0xc0
Jul 22 13:02:44 TowerMediaServ kernel: expire_timers+0x7f/0x8e
Jul 22 13:02:44 TowerMediaServ kernel: run_timer_softirq+0x72/0x120
Jul 22 13:02:44 TowerMediaServ kernel: ? hrtimer_init+0x2/0x2
Jul 22 13:02:44 TowerMediaServ kernel: ? hrtimer_wakeup+0x19/0x1c
Jul 22 13:02:44 TowerMediaServ kernel: ? __hrtimer_run_queues+0xbd/0x105
Jul 22 13:02:44 TowerMediaServ kernel: ? recalibrate_cpu_khz+0x1/0x1
Jul 22 13:02:44 TowerMediaServ kernel: ? ktime_get+0x3a/0x8d
Jul 22 13:02:44 TowerMediaServ kernel: __do_softirq+0xce/0x1e2
Jul 22 13:02:44 TowerMediaServ kernel: irq_exit+0x5e/0x9d
Jul 22 13:02:44 TowerMediaServ kernel: smp_apic_timer_interrupt+0x7e/0x91
Jul 22 13:02:44 TowerMediaServ kernel: apic_timer_interrupt+0xf/0x20
Jul 22 13:02:44 TowerMediaServ kernel: </IRQ>
Jul 22 13:02:44 TowerMediaServ kernel: RIP: 0010:cpuidle_enter_state+0xe8/0x141
Jul 22 13:02:44 TowerMediaServ kernel: Code: ff 45 84 ff 74 1d 9c 58 0f 1f 44 00 00 0f ba e0 09 73 09 0f 0b fa 66 0f 1f 44 00 00 31 ff e8 ae 0c be ff fb 66 0f 1f 44 00 00 <48> 2b 1c 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 39 cb
Jul 22 13:02:44 TowerMediaServ kernel: RSP: 0018:ffffc90001937ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Jul 22 13:02:44 TowerMediaServ kernel: RAX: ffff88841f560b00 RBX: 0000025600146501 RCX: 000000000000001f
Jul 22 13:02:44 TowerMediaServ kernel: RDX: 0000025600146501 RSI: 0000000025a594f5 RDI: 0000000000000000
Jul 22 13:02:44 TowerMediaServ kernel: RBP: ffff88841f56b500 R08: 0000000000000002 R09: 00000000000203c0
Jul 22 13:02:44 TowerMediaServ kernel: R10: 0000000000287868 R11: 00000817aa7850e9 R12: 0000000000000004
Jul 22 13:02:44 TowerMediaServ kernel: R13: 0000000000000004 R14: ffffffff81e5a018 R15: 0000000000000000
Jul 22 13:02:44 TowerMediaServ kernel: do_idle+0x192/0x20e
Jul 22 13:02:44 TowerMediaServ kernel: cpu_startup_entry+0x6a/0x6c
Jul 22 13:02:44 TowerMediaServ kernel: start_secondary+0x197/0x1b2
Jul 22 13:02:44 TowerMediaServ kernel: secondary_startup_64+0xa4/0xb0
Jul 22 13:02:44 TowerMediaServ kernel: ---[ end trace 26a17b115aa8021d ]---

 

Last time I touched it was last night and I was using PlexMediaServer docker, a windows 7 VM, and a Windows 8 VM w/ GPU passthrough running Blue Iris.  I had also recently completed several file transfers using krusader and invoked the mover but this may not be relevant.  I was having high CPU utilization and couldn't figure out what was causing it.  I think I may have shut down my windows 8 VM in order to ensure Plex was able to run smoothly as I had a user watching something.  That was the last time I interacted with it until i came home today and noticed my home automation system wasn't working (which runs on the win7 VM). Then I noticed nothing on my unRAID system was working - no VMs, dockers, webUI, or console.  After realizing it was fully locked up I pressed the power switch once hoping to initiate a graceful shutdown which luckily appeared to work - after rebooting I browsed to my flash drive over the network and uploaded diagnostics here.  I appreciate any assistance!

towermediaserv-diagnostics-20190723-1724.zip

Link to comment

You're being bombed with this:

Jul 23 03:40:44 TowerMediaServ kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0xB7.
Jul 23 03:40:44 TowerMediaServ kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x37.
Jul 23 03:42:45 TowerMediaServ kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.

Top of my head, you should set the disks that are attached to your 3ware controller to have the SMART controller type to be 3ware (click on each attached disk in the Main Tab)

 

It *may* also help to uninstall the preclear and statistics sender plugins when you're not actively using them.

Link to comment
28 minutes ago, Squid said:

You're being bombed with this:


Jul 23 03:40:44 TowerMediaServ kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0xB7.
Jul 23 03:40:44 TowerMediaServ kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x37.
Jul 23 03:42:45 TowerMediaServ kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.

Top of my head, you should set the disks that are attached to your 3ware controller to have the SMART controller type to be 3ware (click on each attached disk in the Main Tab)

 

It *may* also help to uninstall the preclear and statistics sender plugins when you're not actively using them.

Thanks for the suggestion.  They are all already set to be 3ware.  I've had those errors for literally 2+ years and as far as I can tell they are benign.  Something about the controller not passing a certain command when it's trying to pull temperature data or something.  I can't rule them out as being related to my problem, but judging by how long they've been present when the machine was otherwise operating without issue, I think there is something else going on here.  

 

I'll try uninstalling preclear and the statistics sender plugin and report back if that improves anything.

Link to comment

The call traces seem to be related to my win8 VM I have running Blue Iris.  I've left it shut down for the past 4 days and the server has been running everything else without any issues.  Not really sure how to go about figuring out whats wrong with the win8 VM, nothing has really changed configuration wise, except maybe some windows updates...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.