One/Some Plugins seem to be crashing my system - but which?


meep

Recommended Posts

ISSUE SUMMARY
Long story* short, I've had occasion to change/update some of my system resulting in needing a fresh parity sync.

Every time I run this, after a few hours, my system crashes (details below), requiring a restart and rinse & repeat.

I tried booting in safe mode and the system / sync ran perfectly (~9 hrs)

 

I deduce (maybe incorrectly), that one or several of the plugins is causing my system to become unstable/ But which one?

 

My next step will be to try to identify the rouge plugin through conducting 50% tests, whereby I enable half, and see if I can get the system to crash, trying to whittle things down that way. However, if it's more than one culprit, that might not help, and will take a long time.

 

I post in the hope that someone more familiar with syslogs and diagnosics mightg be able to poing be in the right direction.

 

WHATS THE PROBLEM?

Essentially, my system runs for a few hours under load (parity sync or basic rsync data copy tasks), but will then fall over with errors similar to this typical example;

 

Apr 15 22:30:45 UnRaid kernel: docker0: port 1(vethb543e50) entered disabled state
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Apr 15 22:30:47 UnRaid kernel: device vethb543e50 left promiscuous mode
Apr 15 22:30:47 UnRaid kernel: docker0: port 1(vethb543e50) entered disabled state
Apr 15 22:35:02 UnRaid emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Apr 15 22:35:44 UnRaid kernel: BUG: unable to handle kernel paging request at ffff88982491f000
Apr 15 22:35:44 UnRaid kernel: PGD 2401067 P4D 2401067 PUD 1825578063 PMD 1825309063 PTE 800f00182491f163
Apr 15 22:35:44 UnRaid kernel: Oops: 0009 [#1] SMP NOPTI
Apr 15 22:35:44 UnRaid kernel: CPU: 1 PID: 44323 Comm: unraidd0 Tainted: G           O      4.19.107-Unraid #1
Apr 15 22:35:44 UnRaid kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.90 12/04/2019
Apr 15 22:35:44 UnRaid kernel: RIP: 0010:raid6_avx24_gen_syndrome+0xe2/0x19c
Apr 15 22:35:44 UnRaid kernel: Code: c4 41 0d fc f6 c5 d5 db e8 c5 c5 db f8 c5 15 db e8 c5 05 db f8 c5 dd ef e5 c5 cd ef f7 c4 41 1d ef e5 c4 41 0d ef f7 48 8b 0a <c5> fd 6f 2c 01 c4 a1 7d 6f 3c 01 c4 21 7d 6f 2c 09 c4 21 7d 6f 3c
Apr 15 22:35:44 UnRaid kernel: RSP: 0018:ffffc90010907d50 EFLAGS: 00010202
Apr 15 22:35:44 UnRaid kernel: RAX: 0000000000000000 RBX: ffff888b83ad4c40 RCX: ffff88982491f000
Apr 15 22:35:44 UnRaid kernel: RDX: ffff888b83ad4c60 RSI: 0000000000000000 RDI: 0000000000000004
Apr 15 22:35:44 UnRaid kernel: RBP: 0000000000000004 R08: 0000000000000020 R09: 0000000000000040
Apr 15 22:35:44 UnRaid kernel: R10: 0000000000000060 R11: ffff888b83ad4c60 R12: ffff8897e2e39000
Apr 15 22:35:44 UnRaid kernel: R13: 0000000000001000 R14: ffff8897ec3a9000 R15: 0000000000000004
Apr 15 22:35:44 UnRaid kernel: FS:  0000000000000000(0000) GS:ffff88982ce40000(0000) knlGS:0000000000000000
Apr 15 22:35:44 UnRaid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 15 22:35:44 UnRaid kernel: CR2: ffff88982491f000 CR3: 0000001824e78000 CR4: 00000000003406e0
Apr 15 22:35:44 UnRaid kernel: Call Trace:
Apr 15 22:35:44 UnRaid kernel: raid6_generate_pq+0x7d/0xb0 [md_mod]
Apr 15 22:35:44 UnRaid kernel: unraidd+0xfae/0x136e [md_mod]
Apr 15 22:35:44 UnRaid kernel: ? __schedule+0x4f7/0x548
Apr 15 22:35:44 UnRaid kernel: ? md_thread+0xee/0x115 [md_mod]
Apr 15 22:35:44 UnRaid kernel: md_thread+0xee/0x115 [md_mod]
Apr 15 22:35:44 UnRaid kernel: ? wait_woken+0x6a/0x6a
Apr 15 22:35:44 UnRaid kernel: ? md_open+0x2c/0x2c [md_mod]
Apr 15 22:35:44 UnRaid kernel: kthread+0x10c/0x114
Apr 15 22:35:44 UnRaid kernel: ? kthread_park+0x89/0x89
Apr 15 22:35:44 UnRaid kernel: ret_from_fork+0x1f/0x40
Apr 15 22:35:44 UnRaid kernel: Modules linked in: iptable_mangle xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables md_mod xfs nfsd lockd grace sunrpc nct6775 hwmon_vid bonding igb(O) wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm mpt3sas crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc i2c_piix4 i2c_core aesni_intel aes_x86_64 crypto_simd cryptd raid_class scsi_transport_sas ahci k10temp ccp nvme glue_helper libahci nvme_core wmi button pcc_cpufreq acpi_cpufreq [last unloaded: md_mod]
Apr 15 22:35:44 UnRaid kernel: CR2: ffff88982491f000
Apr 15 22:35:44 UnRaid kernel: ---[ end trace 10a9007de2546ba5 ]---
Apr 15 22:35:44 UnRaid kernel: RIP: 0010:raid6_avx24_gen_syndrome+0xe2/0x19c
Apr 15 22:35:44 UnRaid kernel: Code: c4 41 0d fc f6 c5 d5 db e8 c5 c5 db f8 c5 15 db e8 c5 05 db f8 c5 dd ef e5 c5 cd ef f7 c4 41 1d ef e5 c4 41 0d ef f7 48 8b 0a <c5> fd 6f 2c 01 c4 a1 7d 6f 3c 01 c4 21 7d 6f 2c 09 c4 21 7d 6f 3c
Apr 15 22:35:44 UnRaid kernel: RSP: 0018:ffffc90010907d50 EFLAGS: 00010202
Apr 15 22:35:44 UnRaid kernel: RAX: 0000000000000000 RBX: ffff888b83ad4c40 RCX: ffff88982491f000
Apr 15 22:35:44 UnRaid kernel: RDX: ffff888b83ad4c60 RSI: 0000000000000000 RDI: 0000000000000004
Apr 15 22:35:44 UnRaid kernel: RBP: 0000000000000004 R08: 0000000000000020 R09: 0000000000000040
Apr 15 22:35:44 UnRaid kernel: R10: 0000000000000060 R11: ffff888b83ad4c60 R12: ffff8897e2e39000
Apr 15 22:35:44 UnRaid kernel: R13: 0000000000001000 R14: ffff8897ec3a9000 R15: 0000000000000004
Apr 15 22:35:44 UnRaid kernel: FS:  0000000000000000(0000) GS:ffff88982ce40000(0000) knlGS:0000000000000000
Apr 15 22:35:44 UnRaid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 15 22:35:44 UnRaid kernel: CR2: ffff88982491f000 CR3: 0000001824e78000 CR4: 00000000003406e0
Apr 15 22:35:44 UnRaid kernel: ------------[ cut here ]------------
Apr 15 22:35:44 UnRaid kernel: WARNING: CPU: 1 PID: 44323 at kernel/exit.c:778 do_exit+0x64/0x922
Apr 15 22:35:44 UnRaid kernel: Modules linked in: iptable_mangle xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables md_mod xfs nfsd lockd grace sunrpc nct6775 hwmon_vid bonding igb(O) wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm mpt3sas crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc i2c_piix4 i2c_core aesni_intel aes_x86_64 crypto_simd cryptd raid_class scsi_transport_sas ahci k10temp ccp nvme glue_helper libahci nvme_core wmi button pcc_cpufreq acpi_cpufreq [last unloaded: md_mod]
Apr 15 22:35:44 UnRaid kernel: CPU: 1 PID: 44323 Comm: unraidd0 Tainted: G      D    O      4.19.107-Unraid #1
Apr 15 22:35:44 UnRaid kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.90 12/04/2019
Apr 15 22:35:44 UnRaid kernel: RIP: 0010:do_exit+0x64/0x922
Apr 15 22:35:44 UnRaid kernel: Code: 39 d0 75 1a 48 8b 48 10 48 8d 50 10 48 39 d1 75 0d 48 8b 50 20 48 83 c0 20 48 39 c2 74 0e 48 c7 c7 79 fc d2 81 e8 6f b2 03 00 <0f> 0b 65 8b 05 28 2a fc 7e 25 00 ff 1f 00 48 c7 c7 b9 03 d4 81 89
Apr 15 22:35:44 UnRaid kernel: RSP: 0018:ffffc90010907ee8 EFLAGS: 00010046
Apr 15 22:35:44 UnRaid kernel: RAX: 0000000000000024 RBX: ffff8897e3496c00 RCX: 0000000000000007
Apr 15 22:35:44 UnRaid kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88982ce564f0
Apr 15 22:35:44 UnRaid kernel: RBP: 0000000000000009 R08: 000000000000000f R09: ffff8880000bdc00
Apr 15 22:35:44 UnRaid kernel: R10: 0000000000000000 R11: 0000000000000044 R12: ffff88982491f000
Apr 15 22:35:44 UnRaid kernel: R13: ffff8897e3496c00 R14: 0000000000000009 R15: 0000000000000009
Apr 15 22:35:44 UnRaid kernel: FS:  0000000000000000(0000) GS:ffff88982ce40000(0000) knlGS:0000000000000000
Apr 15 22:35:44 UnRaid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 15 22:35:44 UnRaid kernel: CR2: ffff88982491f000 CR3: 0000001824e78000 CR4: 00000000003406e0
Apr 15 22:35:44 UnRaid kernel: Call Trace:
Apr 15 22:35:44 UnRaid kernel: ? md_open+0x2c/0x2c [md_mod]
Apr 15 22:35:44 UnRaid kernel: ? kthread+0x10c/0x114
Apr 15 22:35:44 UnRaid kernel: rewind_stack_do_exit+0x17/0x20
Apr 15 22:35:44 UnRaid kernel: ---[ end trace 10a9007de2546ba6 ]---
Apr 15 22:35:51 UnRaid kernel: XFS (sdp1): Unmounting Filesystem
Apr 15 23:01:42 UnRaid emhttpd: shcmd (5654): /usr/sbin/hdparm -y /dev/nvme1n1
Apr 15 23:01:42 UnRaid root:  HDIO_DRIVE_CMD(standby) failed: Inappropriate ioctl for device
Apr 15 23:01:42 UnRaid root: 
Apr 15 23:01:42 UnRaid root: /dev/nvme1n1:
Apr 15 23:01:42 UnRaid root:  issuing standby command
Apr 15 23:01:42 UnRaid emhttpd: shcmd (5654): exit status: 25

Sometimes, the system will partially fail, in that disks will become inaccessible or the UI will flake out, but it's sometime possible to shutdown from command line. In other cases, the system freezes, with blinking keyboard lights and only a hard reset will do.

 

Switching to safe mode makes the problem go away.

 

I have attached a diagnostics archive for both the crash-state system, and for the working safe-mode system.

 

WHAT HAVE I TRIED?

The following have had no impact;

  • Upgrade from system 6.8.2 to 6.8.3
  • Disabled both Docker & VMs (individually & together)
  • Updated all plugins to current version
  • Optimised bios for stability (disable c-states, reduced memory frequency and a few other bits & bobs)
  • Removed some smaller drives from array & new config
  • Removed some PCIe devices

 

 

*WHAT INSTIGATED ALL THIS?

I made a goodly number of changes to my system over the weekend including adding and shuffling PCIe devices, and adding extra RAM, some rewiring etc.

On reboot, my dual parity drives disappeared and, when I resurrected them, the system called for a parity sync. I started this, but during the sync, I noticed one of my array drives was offline in a couldn't be mounted / needs to be formatted state.

It was a superblock issue on XFS (input/output error) and no amount of coxing could get it back. Fortunately, I could mount the drive and rsync the contents off to a spare unassigned device.

 

During this process, my USB key also showed up as blacklisted, so in the space of a few hours, I lost parity, one of my array drives and my system 😞

 

I got USB back through a windows repair, and with the bad disk contents saved, tossed the offender out of the array (I tried all the XFS repair stuff, a complete disk Zero via DD and some other things, but I could not get that drive to format again.).

 

I transferred the saved files back to one of the other array disks and got all my data back online. I then re-added the parity disks and started the sync.

 

Throughout all of this, I was getting frequent crashes as described, and only when my 3rd parity sync failed, did I move to safe mode. I've no got the array working and parity in place. I just need to figure out how I can get stability back with my plugins enabled.

 

Thanks for reading.

 

unraid-diagnostics-20200416-1825_safemode_stable.zip

unraid-diagnostics-20200415-2332_plugins_crash.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.