Unraid 6.9 keeps crashing

sv3ndev · May 16, 2020

Hello,

I have been using UnRaid for over 4 years now and have been very satisfied with it. However, in the past 2 months, I have encountered a problem with my server that makes it almost unusable.

This problem also existed with Unraid 6.8rc4, I updated to 6.9.0-beta1 in the hopes that this would fix the problem.

The problem is that after booting the server, after 8-72 hours, the server will become unresponsive. Ping attempts are unsuccessful. The machine itself is still running, but I cannot interact with UnRaid in any way. This means that once a day, I have to force restart the server, which is not a viable option for the future.

I've looked through the syslog and it appears that whenever this happens, the following output is logged every 3 minutes:

May 16 00:54:29 NAS kernel: rcu: INFO: rcu_sched self-detected stall on CPU
May 16 00:54:29 NAS kernel: rcu: 	6-....: (599938 ticks this GP) idle=6ee/1/0x4000000000000002 softirq=92762467/92762467 fqs=149960 
May 16 00:54:29 NAS kernel: 	(t=600010 jiffies g=105284805 q=28641)
May 16 00:54:29 NAS kernel: NMI backtrace for cpu 6
May 16 00:54:29 NAS kernel: CPU: 6 PID: 3531 Comm: du Tainted: G      D           5.5.8-Unraid #1
May 16 00:54:29 NAS kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./C2750D4I, BIOS P2.90 01/26/2016
May 16 00:54:29 NAS kernel: Call Trace:
May 16 00:54:29 NAS kernel: <IRQ>
May 16 00:54:29 NAS kernel: dump_stack+0x64/0x7c
May 16 00:54:29 NAS kernel: ? lapic_can_unplug_cpu+0x8e/0x8e
May 16 00:54:29 NAS kernel: nmi_cpu_backtrace+0x73/0x85
May 16 00:54:29 NAS kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
May 16 00:54:29 NAS kernel: rcu_dump_cpu_stacks+0x89/0xb0
May 16 00:54:29 NAS kernel: rcu_sched_clock_irq+0x1e4/0x513
May 16 00:54:29 NAS kernel: update_process_times+0x1f/0x3d
May 16 00:54:29 NAS kernel: tick_sched_timer+0x33/0x62
May 16 00:54:29 NAS kernel: __hrtimer_run_queues+0xb7/0x10b
May 16 00:54:29 NAS kernel: ? tick_sched_do_timer+0x39/0x39
May 16 00:54:29 NAS kernel: hrtimer_interrupt+0x8d/0x160
May 16 00:54:29 NAS kernel: smp_apic_timer_interrupt+0x6a/0x7a
May 16 00:54:29 NAS kernel: apic_timer_interrupt+0xf/0x20
May 16 00:54:29 NAS kernel: </IRQ>
May 16 00:54:29 NAS kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x9b/0x1f2
May 16 00:54:29 NAS kernel: Code: b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 89 44 24 04 74 0c 0f ba e0 08 72 1e c6 47 01 00 eb 18 85 c0 74 0a 8b 07 <84> c0 74 04 f3 90 eb f6 66 c7 07 01 00 e9 2b 01 00 00 48 c7 c0 00
May 16 00:54:29 NAS kernel: RSP: 0018:ffffc9000db2fe20 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
May 16 00:54:29 NAS kernel: RAX: 0000000000180101 RBX: ffff8880ad4abcc0 RCX: 000000000000001d
May 16 00:54:29 NAS kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881e6db5c00
May 16 00:54:29 NAS kernel: RBP: ffff8880ad4ab900 R08: 0000000000000000 R09: 0000000000000000
May 16 00:54:29 NAS kernel: R10: ffff8880955dd500 R11: ffff888050864b10 R12: ffff8880955dd7e8
May 16 00:54:29 NAS kernel: R13: 000000000000001d R14: 0000000000038800 R15: ffff8881e6db5c00
May 16 00:54:29 NAS kernel: queued_spin_lock_slowpath+0x7/0xa
May 16 00:54:29 NAS kernel: do_raw_spin_lock+0x38/0x52
May 16 00:54:29 NAS kernel: fuse_prepare_release+0x63/0xd2
May 16 00:54:29 NAS kernel: fuse_release_common+0x32/0x83
May 16 00:54:29 NAS kernel: fuse_dir_release+0xd/0x10
May 16 00:54:29 NAS kernel: __fput+0x108/0x1d1
May 16 00:54:29 NAS kernel: task_work_run+0x77/0x88
May 16 00:54:29 NAS kernel: prepare_exit_to_usermode+0xa6/0x126
May 16 00:54:29 NAS kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 16 00:54:29 NAS kernel: RIP: 0033:0x15493e186fb3
May 16 00:54:29 NAS kernel: Code: e9 47 ff ff ff b8 ff ff ff ff e9 3d ff ff ff 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
May 16 00:54:29 NAS kernel: RSP: 002b:00007ffc8c6bf508 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
May 16 00:54:29 NAS kernel: RAX: 0000000000000000 RBX: 0000000000433f20 RCX: 000015493e186fb3
May 16 00:54:29 NAS kernel: RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000008
May 16 00:54:29 NAS kernel: RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000006
May 16 00:54:29 NAS kernel: R10: 000000001ea355fb R11: 0000000000000246 R12: 0000000000000005
May 16 00:54:29 NAS kernel: R13: 0000000000444830 R14: 0000000000433f20 R15: 0000000000000005

I've also attached the full syslog in case anyone wants to read through it.

I would very much appreciate any help you could give me! I am using this server for my company and am desperate for a fix.

Thank you!

syslog

trurl · May 16, 2020

Go to Tools-diagnostics and attach the complete Diagnostics ZIP file to your NEXT post.

sv3ndev · May 16, 2020

Thanks for your response. I looked at that but the problem is that it seems that the logs in the diagnostics zip are stored in RAM. So, when I force restart it, everything gets wiped and the logs are completely empty.

Do you know how I can prevent them from getting erased and store them somehow? The only way I got my syslog was by enabling logging to flash.

Here's an old diagnostics zip I downloaded immediately after the first time I encountered the problem. Hopefully it'll be of use.

nas-diagnostics-20200429-1020.zip

trurl · May 16, 2020

12 minutes ago, sv3ndev said:

Do you know how I can prevent them from getting erased and store them somehow?

sv3ndev · May 16, 2020

Thanks, but I'm a bit confused. That article is about the syslog server, and I did enable that - but all it gives me is the syslog, which I posted. I've looked through the diagnostics.zip and it appears as if most of them are basic info about the config, so they wouldn't really change, and I attached the full syslog in my earlier post.

I'm really sorry, but I'm new to UnRaid debugging because it's always worked so well - so I apologize for these questions! But I think that I uploaded everything relevant now, right?

I'm going to reboot into safe mode just to eliminate any possibility of plugin problems, as some of them are deprecated. Unfortunately, because of the nature of the problem, I won't really know if it works until next week.

Nevertheless, thank you so much for your help!

trurl · May 16, 2020

4 minutes ago, sv3ndev said:

I've looked through the diagnostics.zip and it appears as if most of them are basic info about the config, so they wouldn't really change

Not a question of whether they changed. Just trying to get a more complete picture. You didn't tell us anything really about your hardware, and rather than pulling teeth to get that information, diagnostics makes it easier for everyone. And other things a user can do to their configuration can lead to problems. Many things can get diagnosed by looking at things other than syslog,

If you had been running Ryzen, for example, I would have had other suggestions for that since there are some tweaks that can help with those CPUs and crashes.

3 minutes ago, sv3ndev said:

I uploaded everything relevant now, right?

yes

sv3ndev · May 16, 2020

23 minutes ago, trurl said:

Not a question of whether they changed. Just trying to get a more complete picture. You didn't tell us anything really about your hardware, and rather than pulling teeth to get that information, diagnostics makes it easier for everyone. And other things a user can do to their configuration can lead to problems. Many things can get diagnosed by looking at things other than syslog

Ah okay, thanks! The thing is that this system has been running perfectly nonstop for 4 years, and the last thing I changed was a VM I added 2 months ago, long before any problems started. So, I'd be surprised if it's a configuration problem because I wouldn't expect those to just suddenly occur. That's why I'm really confused - I mainly work in IT Security so dealing with things like this isn't my specialty, and because I didn't change anything I don't really have a lead on where to begin debugging.

sv3ndev · May 18, 2020

As an update for anyone with a similar problem, I've been running the server in Safe Mode with GUI enabled for 2 days now, and so far without a problem. It seems as if the problem was most likely a deprecated or otherwise malfunctioning plugin.

Squid · May 18, 2020

34 minutes ago, sv3ndev said:

It seems as if the problem was most likely a deprecated or otherwise malfunctioning plugin.

The diagnostics didn't cover enough time for FCP to have listed things, but it would have been telling you

Resilio.plg - 2016.09.17.1 --- Unknown and shouldn't be installed (In truth though this is a PhAzE plugin which has been removed from CA because it's not supported, not updated, and very likely will cause problems)

newransomware.bait.plg - 2018.07.02 - Deprecated

ca.backup.plg - 2017.10.28 - Deprecated - Known to cause issues under certain circumstances and replaced by a v2 version

dynamix.cache.dirs.plg - 2018.12.04 - Way way way out of date

sv3ndev · May 18, 2020

13 minutes ago, Squid said:

The diagnostics didn't cover enough time for FCP to have listed things, but it would have been telling you

Resilio.plg - 2016.09.17.1 --- Unknown and shouldn't be installed (In truth though this is a PhAzE plugin which has been removed from CA because it's not supported, not updated, and very likely will cause problems)

newransomware.bait.plg - 2018.07.02 - Deprecated

ca.backup.plg - 2017.10.28 - Deprecated - Known to cause issues under certain circumstances and replaced by a v2 version

dynamix.cache.dirs.plg - 2018.12.04 - Way way way out of date

Thanks! I have to admit, I've been using UnRaid mainly as a NAS in a "set it and forget it" way - and I definitely have some maintenance work to do. I'll leave it running in safe mode a while longer to make sure it's really stable now, and then I'll begin reactivating the plugins that are still being maintained. I'll definitely remove those plugins though, they're definitely more trouble than they're worth.

trurl · May 18, 2020

1 hour ago, sv3ndev said:

"set it and forget it"

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

sv3ndev · May 18, 2020

30 minutes ago, trurl said:

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

Yes, I have email notifications configured - I used to get them all the time because my drives were getting a bit warm, but since fixing that I barely have any problems. It never warned me of any impending problem, the cpu stall warnings would just suddenly appear in the syslog, nothing else.

Edited May 18, 2020 by sv3ndev

Unraid 6.9 keeps crashing

Recommended Posts

sv3ndev

Link to comment

trurl

Link to comment

sv3ndev

Link to comment

trurl

Link to comment

sv3ndev

Link to comment

trurl

Link to comment

sv3ndev

Link to comment

sv3ndev

Link to comment

Squid

Link to comment

sv3ndev

Link to comment

trurl

Link to comment

sv3ndev

Link to comment

Join the conversation