Freeze / crash issue resulting in network lock-up and BTRFS corruption?


Recommended Posts

Now something happened, but I'm not sure if it is the same issue. The network is still up, and the machine is still reachable but all cores except one were stuck on 100% iowait, the mover is running and doesn't seem to make progress. the syslog spits out this error:

 

Jun  5 00:39:29 Nethub shutdown[1443]: shutting down for system reboot
Jun  5 00:40:33 Nethub kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun  5 00:40:33 Nethub kernel: rcu:     5-....: (240002 ticks this GP) idle=566/1/0x4000000000000002 softirq=18902205/18902205 fqs=58516
Jun  5 00:40:33 Nethub kernel: rcu:      (t=240004 jiffies g=37787229 q=503044)
Jun  5 00:40:33 Nethub kernel: Sending NMI from CPU 5 to CPUs 4:
Jun  5 00:40:33 Nethub kernel: NMI backtrace for cpu 4
Jun  5 00:40:33 Nethub kernel: CPU: 4 PID: 30342 Comm: kworker/u16:2 Tainted: G    B D W         4.19.107-Unraid #1
Jun  5 00:40:33 Nethub kernel: Hardware name: MSI MS-7A63/Z270 GAMING PRO CARBON (MS-7A63), BIOS 1.90 07/03/2018
Jun  5 00:40:33 Nethub kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper
Jun  5 00:40:33 Nethub kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x11e/0x171
Jun  5 00:40:33 Nethub kernel: Code: 48 03 04 cd 20 37 db 81 48 89 10 8b 42 08 85 c0 75 04 f3 90 eb f5 48 8b 0a 48 85 c9 74 c9 0f 0d 09 8b 07 66 85 c0 74 04 f3 90 <eb> f5 41 89 c0 66 45 31 c0 44 39 c6 74 0a 48 85 c9 c6 07 01 75 1b
Jun  5 00:40:33 Nethub kernel: RSP: 0018:ffffc9000ce77908 EFLAGS: 00000202
Jun  5 00:40:33 Nethub kernel: RAX: 0000000000140101 RBX: ffff88880b9a8a00 RCX: 0000000000000000
Jun  5 00:40:33 Nethub kernel: RDX: ffff88884eb20740 RSI: 0000000000140000 RDI: ffff88880b9a8b60
Jun  5 00:40:33 Nethub kernel: RBP: ffff8881076c61a0 R08: 0000000000000005 R09: 0000000000000000
Jun  5 00:40:33 Nethub kernel: R10: ffff88880b9a8b60 R11: ffff88884e405301 R12: ffff888535e4f130
Jun  5 00:40:33 Nethub kernel: R13: ffff8882e09824e0 R14: ffff8888475a2000 R15: 0000000000000000
Jun  5 00:40:33 Nethub kernel: FS:  0000000000000000(0000) GS:ffff88884eb00000(0000) knlGS:0000000000000000
Jun  5 00:40:33 Nethub kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  5 00:40:33 Nethub kernel: CR2: 0000153317b04000 CR3: 0000000001e0a002 CR4: 00000000003606e0
Jun  5 00:40:33 Nethub kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun  5 00:40:33 Nethub kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun  5 00:40:33 Nethub kernel: Call Trace:
Jun  5 00:40:33 Nethub kernel: _raw_spin_lock+0x16/0x19
Jun  5 00:40:33 Nethub kernel: btrfs_add_delayed_tree_ref+0x214/0x2a4
Jun  5 00:40:33 Nethub kernel: btrfs_alloc_tree_block+0x483/0x510
Jun  5 00:40:33 Nethub kernel: alloc_tree_block_no_bg_flush+0x45/0x4d
Jun  5 00:40:33 Nethub kernel: __btrfs_cow_block+0x143/0x4ee
Jun  5 00:40:33 Nethub kernel: btrfs_cow_block+0x105/0x113
Jun  5 00:40:33 Nethub kernel: btrfs_search_slot+0x330/0x84a
Jun  5 00:40:33 Nethub kernel: btrfs_lookup_file_extent+0x47/0x61
Jun  5 00:40:33 Nethub kernel: __btrfs_drop_extents+0x16f/0xb12
Jun  5 00:40:33 Nethub kernel: ? next_state+0x9/0x13
Jun  5 00:40:33 Nethub kernel: ? __set_extent_bit+0x280/0x430
Jun  5 00:40:33 Nethub kernel: insert_reserved_file_extent.constprop.0+0x98/0x2cc
Jun  5 00:40:33 Nethub kernel: btrfs_finish_ordered_io+0x317/0x5d2
Jun  5 00:40:33 Nethub kernel: normal_work_helper+0xd0/0x1c7
Jun  5 00:40:33 Nethub kernel: process_one_work+0x16e/0x24f
Jun  5 00:40:33 Nethub kernel: worker_thread+0x1e2/0x2b8
Jun  5 00:40:33 Nethub kernel: ? rescuer_thread+0x2a7/0x2a7
Jun  5 00:40:33 Nethub kernel: kthread+0x10c/0x114
Jun  5 00:40:33 Nethub kernel: ? kthread_park+0x89/0x89
Jun  5 00:40:33 Nethub kernel: ret_from_fork+0x35/0x40
Jun  5 00:40:33 Nethub kernel: NMI backtrace for cpu 5
Jun  5 00:40:33 Nethub kernel: CPU: 5 PID: 32176 Comm: kworker/u16:1 Tainted: G    B D W         4.19.107-Unraid #1
Jun  5 00:40:33 Nethub kernel: Hardware name: MSI MS-7A63/Z270 GAMING PRO CARBON (MS-7A63), BIOS 1.90 07/03/2018
Jun  5 00:40:33 Nethub kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper
Jun  5 00:40:33 Nethub kernel: Call Trace:
Jun  5 00:40:33 Nethub kernel: <IRQ>
Jun  5 00:40:33 Nethub kernel: dump_stack+0x67/0x83
Jun  5 00:40:33 Nethub kernel: nmi_cpu_backtrace+0x71/0x83
Jun  5 00:40:33 Nethub kernel: ? lapic_can_unplug_cpu+0x97/0x97
Jun  5 00:40:33 Nethub kernel: nmi_trigger_cpumask_backtrace+0x57/0xd4
Jun  5 00:40:33 Nethub kernel: rcu_dump_cpu_stacks+0x8b/0xb4
Jun  5 00:40:33 Nethub kernel: rcu_check_callbacks+0x296/0x5a0
Jun  5 00:40:33 Nethub kernel: update_process_times+0x24/0x47
Jun  5 00:40:33 Nethub kernel: tick_sched_timer+0x36/0x64
Jun  5 00:40:33 Nethub kernel: __hrtimer_run_queues+0xb7/0x10b
Jun  5 00:40:33 Nethub kernel: ? tick_sched_handle.isra.0+0x2f/0x2f
Jun  5 00:40:33 Nethub kernel: hrtimer_interrupt+0xf4/0x20e
Jun  5 00:40:33 Nethub kernel: smp_apic_timer_interrupt+0x7b/0x93
Jun  5 00:40:33 Nethub kernel: apic_timer_interrupt+0xf/0x20
Jun  5 00:40:33 Nethub kernel: </IRQ>
Jun  5 00:40:33 Nethub kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x171
Jun  5 00:40:33 Nethub kernel: Code: 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02 00 65 48 03 15 80 6a f8
Jun  5 00:40:33 Nethub kernel: RSP: 0018:ffffc9000ef4f908 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Jun  5 00:40:33 Nethub kernel: RAX: 0000000000140101 RBX: ffff88880b9a8a00 RCX: 0000000000004000
Jun  5 00:40:33 Nethub kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88880b9a8b60
Jun  5 00:40:33 Nethub kernel: RBP: ffff88880226b8f0 R08: 0000000000000005 R09: 0000000000000000
Jun  5 00:40:33 Nethub kernel: R10: ffff88880b9a8b60 R11: ffff88884e405301 R12: ffff88852d64ced8
Jun  5 00:40:33 Nethub kernel: R13: ffff8881971515b0 R14: ffff8888475a2000 R15: 0000000000000000
Jun  5 00:40:33 Nethub kernel: _raw_spin_lock+0x16/0x19
Jun  5 00:40:33 Nethub kernel: btrfs_add_delayed_tree_ref+0x214/0x2a4
Jun  5 00:40:33 Nethub kernel: btrfs_alloc_tree_block+0x483/0x510
Jun  5 00:40:33 Nethub kernel: alloc_tree_block_no_bg_flush+0x45/0x4d
Jun  5 00:40:33 Nethub kernel: __btrfs_cow_block+0x143/0x4ee
Jun  5 00:40:33 Nethub kernel: btrfs_cow_block+0x105/0x113
Jun  5 00:40:33 Nethub kernel: btrfs_search_slot+0x330/0x84a
Jun  5 00:40:33 Nethub kernel: btrfs_lookup_file_extent+0x47/0x61
Jun  5 00:40:33 Nethub kernel: __btrfs_drop_extents+0x16f/0xb12
Jun  5 00:40:33 Nethub kernel: ? next_state+0x9/0x13
Jun  5 00:40:33 Nethub kernel: ? __set_extent_bit+0x280/0x430
Jun  5 00:40:33 Nethub kernel: insert_reserved_file_extent.constprop.0+0x98/0x2cc
Jun  5 00:40:33 Nethub kernel: btrfs_finish_ordered_io+0x317/0x5d2
Jun  5 00:40:33 Nethub kernel: normal_work_helper+0xd0/0x1c7
Jun  5 00:40:33 Nethub kernel: process_one_work+0x16e/0x24f
Jun  5 00:40:33 Nethub kernel: worker_thread+0x1e2/0x2b8
Jun  5 00:40:33 Nethub kernel: ? rescuer_thread+0x2a7/0x2a7
Jun  5 00:40:33 Nethub kernel: kthread+0x10c/0x114
Jun  5 00:40:33 Nethub kernel: ? kthread_park+0x89/0x89
Jun  5 00:40:33 Nethub kernel: ret_from_fork+0x35/0x40
Jun  5 00:40:33 Nethub kernel: Sending NMI from CPU 5 to CPUs 6:
Jun  5 00:40:33 Nethub kernel: NMI backtrace for cpu 6
Jun  5 00:40:33 Nethub kernel: CPU: 6 PID: 30933 Comm: kworker/u16:0 Tainted: G    B D W         4.19.107-Unraid #1
Jun  5 00:40:33 Nethub kernel: Hardware name: MSI MS-7A63/Z270 GAMING PRO CARBON (MS-7A63), BIOS 1.90 07/03/2018
Jun  5 00:40:33 Nethub kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper
Jun  5 00:40:33 Nethub kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x63/0x171
Jun  5 00:40:33 Nethub kernel: Code: 2f 08 b8 00 01 00 00 0f 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a <8b> 07 84 c0 74 04 f3 90 eb f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02
Jun  5 00:40:33 Nethub kernel: RSP: 0018:ffffc9000d90f9e0 EFLAGS: 00000202
Jun  5 00:40:33 Nethub kernel: RAX: 0000000000000101 RBX: ffff888536993850 RCX: ffffc9000d90fb28
Jun  5 00:40:33 Nethub kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888536993888
Jun  5 00:40:33 Nethub kernel: RBP: ffff88877e7452f8 R08: ffff88839f210ba0 R09: ffffc9000d90fb2c
Jun  5 00:40:33 Nethub kernel: R10: ffff88880b9a8b60 R11: 0000000000000000 R12: ffff88880b9a8b78
Jun  5 00:40:33 Nethub kernel: R13: ffff888536993888 R14: ffffc9000d90fb28 R15: 0000000000000000
Jun  5 00:40:33 Nethub kernel: FS:  0000000000000000(0000) GS:ffff88884eb80000(0000) knlGS:0000000000000000
Jun  5 00:40:33 Nethub kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  5 00:40:33 Nethub kernel: CR2: 0000153317b04000 CR3: 0000000001e0a002 CR4: 00000000003606e0
Jun  5 00:40:33 Nethub kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun  5 00:40:33 Nethub kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun  5 00:40:33 Nethub kernel: Call Trace:
Jun  5 00:40:33 Nethub kernel: _raw_spin_lock+0x16/0x19
Jun  5 00:40:33 Nethub kernel: update_existing_head_ref.isra.0+0x32/0x111
Jun  5 00:40:33 Nethub kernel: add_delayed_ref_head.isra.0+0x102/0x189
Jun  5 00:40:33 Nethub kernel: btrfs_add_delayed_tree_ref+0x231/0x2a4
Jun  5 00:40:33 Nethub kernel: btrfs_free_tree_block+0x86/0x1dd
Jun  5 00:40:33 Nethub kernel: __btrfs_cow_block+0x4a0/0x4ee
Jun  5 00:40:33 Nethub kernel: btrfs_cow_block+0x105/0x113
Jun  5 00:40:33 Nethub kernel: btrfs_search_slot+0x330/0x84a
Jun  5 00:40:33 Nethub kernel: btrfs_lookup_csum+0x4d/0x130
Jun  5 00:40:33 Nethub kernel: ? _cond_resched+0x1b/0x1e
Jun  5 00:40:33 Nethub kernel: ? kmem_cache_alloc+0xdf/0xeb
Jun  5 00:40:33 Nethub kernel: btrfs_csum_file_blocks+0x8b/0x563
Jun  5 00:40:33 Nethub kernel: add_pending_csums+0x40/0x5b
Jun  5 00:40:33 Nethub kernel: btrfs_finish_ordered_io+0x3d2/0x5d2
Jun  5 00:40:33 Nethub kernel: normal_work_helper+0xd0/0x1c7
Jun  5 00:40:33 Nethub kernel: process_one_work+0x16e/0x24f
Jun  5 00:40:33 Nethub kernel: worker_thread+0x1e2/0x2b8
Jun  5 00:40:33 Nethub kernel: ? rescuer_thread+0x2a7/0x2a7
Jun  5 00:40:33 Nethub kernel: kthread+0x10c/0x114
Jun  5 00:40:33 Nethub kernel: ? kthread_park+0x89/0x89
Jun  5 00:40:33 Nethub kernel: ret_from_fork+0x35/0x40
 

I tried to reboot the nas, but then it went to 80% iowait and 20% system fixed on all cores, and even though it logged System reboot NOW, it never rebooted. I tried diagnostics but it never finished generating them.

syslog.log

Edited by permissionBRICK
Link to comment

This time it happened again, and it locked up the network as well, like usual. I guess if it is the fault of the 10g card it is a driver issue, or else it has nothing to do with the 10g card.

 

The only remaining components that i haven't swapped yet are the drives, the PSU and the unraid install itself.

Link to comment

I have the same issue.

There are something interesting in my case.

My unraid server usually crash and bring down the whole network during weekly parity check.

I use 1G switch in my network. When the unraid server crashes, the whole network is down. I think it's a 2-layer problem instead of 3-layer, because other device cannot ping each other in the same network. Once the unraid server is unplugged, network restores. If I plug it back, the network will instantly crash.

Anyone got any ideas? It really drives my crazy:(

Link to comment
24 minutes ago, utopiafar said:

I have the same issue.

Can you post the hardware specs of your setup? If it is entirely different than mine maybe we can root out hardware as the cause completely...

I have already swapped all components except the drives, the PSU and the unraid install itself on the stick.

 

 

Can you post your current diagnostics file, so we can compare installed plugins etc?

nethub-diagnostics-20200615-1146.zip

Link to comment
10 minutes ago, permissionBRICK said:

Can you post the hardware specs of your setup? If it is entirely different than mine maybe we can root out hardware as the cause completely...

I have already swapped all components except the drives, the PSU and the unraid install itself on the stick.

 

 

Can you post your current diagnostics file, so we can compare installed plugins etc?

nethub-diagnostics-20200615-1146.zip 107.69 kB · 0 downloads

unraid-diagnostics-20200615-0959 (1).zip

Link to comment

The issue happened again this weekend, at exactly 2020-06-13 17:10 the network went down. This time I found an error in the syslog of the nas on the syslog sync server. The time of the error is about 2 min before the lockdown occured, and seeing it is the only message after hours of nothing, and the lockdown gradually gets worse until it locks up the entire network, this might very well be the cause or at least related to the issue:

syslog.log

Edited by permissionBRICK
Link to comment
  • 4 weeks later...

I thought the problem was fixed this time when I replaced the 10g card with an SFP+ Card, but it happened again. This time with the SFP+ Card it didn't take down the network, and the server was still accessible initially, however I got the same CPU Stall errors in syslog again, and after a few minutes the server stopped responding again. I have found several other topics and nobody who has encountered these CPU Stall errors seems to have any solution to it despite downgrading, but I have been getting these issues for several versions now, so I have no idea if I can downgrade that far...

 

Also I managed to get a snapshot of netdata when it happened, one of the Cores seems to Stall on SOFTIRQ, while another one stalls on SYSTEM.

 

Anyone got any more Ideas what I can do to try and fix this?

 

Up to now I have replaced every single hardware component except the hard drives, the usb key and the PSU, and the issue persists.

 

Do I need to reinstall Unraid from scratch on a new usb key?

 

syslog.logimage.thumb.png.bfb8cbe2f2733333795485491759336f.png

Edited by permissionBRICK
Link to comment
  • 1 month later...
  • 2 months later...
On 8/27/2020 at 9:22 AM, permissionBRICK said:

Update: Its looking like I finally managed to fix the issue. The last thing I changed was uninstalling anything network related from Nerd Tools, and updating the rest. The server has been up without issues since the last post.

Were you able to identify what nerd tool was causing the instability?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.