Unraid 6.8.1 unresponsive and CPU overload


orem684

Recommended Posts

Earlier today I was using Wireguard to remotely access my LAN and my connection became unresponsive. When I arrived home I noticed that my CPU was showing as overloaded (almost all cores at about 100%), Docker and VMs were disabled and all shares are inaccessible. I attempted a couple of hard shutdowns but Unraid boots back up to the same state. Here is the section of the logs where an error first appears :

Jan 22 17:46:17 Tower kernel: ------------[ cut here ]------------
Jan 22 17:46:17 Tower kernel: kernel BUG at fs/btrfs/ctree.c:3246!
Jan 22 17:46:17 Tower kernel: invalid opcode: 0000 [#1] SMP NOPTI
Jan 22 17:46:17 Tower kernel: CPU: 1 PID: 86 Comm: kworker/u64:3 Not tainted 4.19.94-Unraid #1
Jan 22 17:46:17 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570M Pro4, BIOS P1.50 07/15/2019
Jan 22 17:46:17 Tower kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper
Jan 22 17:46:17 Tower kernel: RIP: 0010:btrfs_set_item_key_safe+0xc0/0x136
Jan 22 17:46:17 Tower kernel: Code: 00 4c 89 ef 48 8d 74 24 07 48 63 d2 48 6b d2 19 48 83 c2 65 e8 78 16 04 00 48 89 de 48 8d 7c 24 07 e8 95 f4 ff ff 85 c0 7f 02 <0f> 0b 48 8b 43 09 49 63 d4 b9 11 00 00 00 4c 89 ef 48 6b d2 19 48
Jan 22 17:46:17 Tower kernel: RSP: 0018:ffffc90001d6fbc0 EFLAGS: 00010286
Jan 22 17:46:17 Tower kernel: RAX: 00000000ffffffff RBX: ffffc90001d6fca5 RCX: 000000000000006c
Jan 22 17:46:17 Tower kernel: RDX: 0000000000000000 RSI: ffffc90001d6fca5 RDI: ffffc90001d6fb9f
Jan 22 17:46:17 Tower kernel: RBP: ffff8882f9ddeaf0 R08: 0000000000001000 R09: 0000160000000000
Jan 22 17:46:17 Tower kernel: R10: ffff888000000000 R11: 0000000000000000 R12: 000000000000003f
Jan 22 17:46:17 Tower kernel: R13: ffff8882ddf7e9d8 R14: 00000000000032c0 R15: ffff8883cd5f8000
Jan 22 17:46:17 Tower kernel: FS:  0000000000000000(0000) GS:ffff88840e640000(0000) knlGS:0000000000000000
Jan 22 17:46:17 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 22 17:46:17 Tower kernel: CR2: 000056445107a5a4 CR3: 000000033b2e2000 CR4: 00000000003406e0
Jan 22 17:46:17 Tower kernel: Call Trace:
Jan 22 17:46:17 Tower kernel: __btrfs_drop_extents+0x5e2/0xb12
Jan 22 17:46:17 Tower kernel: insert_reserved_file_extent.constprop.0+0x98/0x2cc
Jan 22 17:46:17 Tower kernel: btrfs_finish_ordered_io+0x317/0x5d2
Jan 22 17:46:17 Tower kernel: normal_work_helper+0xd0/0x1c7
Jan 22 17:46:17 Tower kernel: process_one_work+0x16e/0x24f
Jan 22 17:46:17 Tower kernel: worker_thread+0x1e2/0x2b8
Jan 22 17:46:17 Tower kernel: ? rescuer_thread+0x2a7/0x2a7
Jan 22 17:46:17 Tower kernel: kthread+0x10c/0x114
Jan 22 17:46:17 Tower kernel: ? kthread_park+0x89/0x89
Jan 22 17:46:17 Tower kernel: ret_from_fork+0x22/0x40
Jan 22 17:46:17 Tower kernel: Modules linked in: veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap macvlan xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod nct6775 hwmon_vid wireguard ip6_udp_tunnel udp_tunnel bonding edac_mce_amd kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc igb aesni_intel aes_x86_64 crypto_simd cryptd k10temp i2c_piix4 i2c_algo_bit glue_helper wmi_bmof i2c_core ccp ahci libahci wmi button pcc_cpufreq acpi_cpufreq
Jan 22 17:46:17 Tower kernel: ---[ end trace 69d3dbdcd1db1e30 ]---
Jan 22 17:46:17 Tower kernel: RIP: 0010:btrfs_set_item_key_safe+0xc0/0x136
Jan 22 17:46:17 Tower kernel: Code: 00 4c 89 ef 48 8d 74 24 07 48 63 d2 48 6b d2 19 48 83 c2 65 e8 78 16 04 00 48 89 de 48 8d 7c 24 07 e8 95 f4 ff ff 85 c0 7f 02 <0f> 0b 48 8b 43 09 49 63 d4 b9 11 00 00 00 4c 89 ef 48 6b d2 19 48
Jan 22 17:46:17 Tower kernel: RSP: 0018:ffffc90001d6fbc0 EFLAGS: 00010286
Jan 22 17:46:17 Tower kernel: RAX: 00000000ffffffff RBX: ffffc90001d6fca5 RCX: 000000000000006c
Jan 22 17:46:17 Tower kernel: RDX: 0000000000000000 RSI: ffffc90001d6fca5 RDI: ffffc90001d6fb9f
Jan 22 17:46:17 Tower kernel: RBP: ffff8882f9ddeaf0 R08: 0000000000001000 R09: 0000160000000000
Jan 22 17:46:17 Tower kernel: R10: ffff888000000000 R11: 0000000000000000 R12: 000000000000003f
Jan 22 17:46:17 Tower kernel: R13: ffff8882ddf7e9d8 R14: 00000000000032c0 R15: ffff8883cd5f8000
Jan 22 17:46:17 Tower kernel: FS:  0000000000000000(0000) GS:ffff88840e640000(0000) knlGS:0000000000000000
Jan 22 17:46:17 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 22 17:46:17 Tower kernel: CR2: 000056445107a5a4 CR3: 000000033b2e2000 CR4: 00000000003406e0
Jan 22 17:46:54 Tower webGUI: Successful login user root from 192.168.1.81
Jan 22 17:47:04 Tower sSMTP[22829]: Creating SSL connection to host
Jan 22 17:47:04 Tower sSMTP[22829]: SSL connection using TLS_AES_256_GCM_SHA384
Jan 22 17:47:06 Tower sSMTP[22829]: Sent mail for ****** (221 2.0.0 closing connection y197sm201426pfc.79 - gsmtp) uid=0 username=root outbytes=665
Jan 22 17:47:10 Tower emhttpd: req (1): csrf_token=****************&title=System Log&cmd=/webGui/scripts/tail_log&arg1=syslog

This is my first time posting here so please let me know if i should include additional information.

Link to comment
4 minutes ago, Squid said:

Post the entire diagnostics.zip file (Tools - Diagnostics)

I was able to start into safe mode to get the diagnostics.zip. I also found out that when I start the array again in safe mode I go back to the unresponsive state. But when I disable Docker before starting the array, I am able to get the shares back up and access most settings. I am currently running a parity check in safe mode.

tower-diagnostics-20200122-1926.zip

Link to comment
44 minutes ago, orem684 said:

I was able to start into safe mode to get the diagnostics.zip. I also found out that when I start the array again in safe mode I go back to the unresponsive state. But when I disable Docker before starting the array, I am able to get the shares back up and access most settings. I am currently running a parity check in safe mode.

tower-diagnostics-20200122-1926.zip 79.72 kB · 1 download

Ultimately what we're looking for is what happens when the array starts.  Configure the syslog server to mirror the syslog to the flash https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-781601  and start the array.  If it's unresponsive after a few minutes, then you'll have to reboot and then post the syslog.txt file that'll be in the logs folder on the flash drive.

  • Like 1
Link to comment
10 hours ago, johnnie.black said:

A btrfs process is crashing but I can't see any other btrfs problem on any of the file systems, so I would first start by running memtest since btrfs is very intolerant of memory errors, also make sure your RAM isn't overclocked, respect max speed depending on config.

 

2111518613_2ndgen.jpg.5814d6df235c63c143c8704fcde3cf46.jpgmemtest

I took down my OC on the ram to 2400 per the table you posted and ran a memtest that returned as pass with 0 errors. Unfortunately I booted back up to the same issue. The memtest and syslog is attached. Any other thoughts?

syslog

Link to comment
11 hours ago, johnnie.black said:

Don't see anything out of ordinary on this syslog, it covers the crash?

So I think I figured out the issue. I notice that the freezing only started after enabling Docker so I decided to delete my image and build a new one. I brought back two of my containers from the template and so far I've had about 12 hours of uptime with the two containers running. I'm guessing my original Docker image must have been corrupted. Thanks everyone for your help.

Edited by orem684
Link to comment
4 hours ago, EgyptianSnakeLegs said:

Let us know if that continues to be the solution.  I'm still having stability issues with my system, and this is something I hadn't considered.

So I just did a reboot and got out of safe mode with no issues. I am running three dockers (pi-hole, unifi controller and Plex) with no stability issues and no errors in the logs so far. I'll continue to slowly add my containers back while monitoring for stability.  I'll let you know if I see any problems.  

  • Like 1
Link to comment
12 hours ago, orem684 said:

I decided to delete my image and build a new one.

Docker image could have been the issue, I did see a couple of btrs crashes earlier but no specific filesystem was mentioned, still recommend running RAM at non overclock speeds, especially with Ryzen, there have been previous cases here of instability and even data corruption.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.