Unraid Server Crashing 2x this week


Recommended Posts

Hi Everyone-

 

Woke up to day to find that my unraid server is unresponsive.  I took a pic of what the server was sending to my monitor and attached the diagnostics.  

 

Last time this happened 4 days ago, it said that my btrfs 2x cache drives were unmountable and had to be reformatted.  I was able to recover my cache data (plex metadata and binhex krusader docker data) and re-integrate it into my server. I ran a parity check that finished yesterday and didn't have any new sync errors.

 

When it occured this morning, i couldn't initiate a clean shutdown bc the keyboard was unresponsive and i couldn't access the server remotely. When the server came back up, it initiated a parity check which i cancelled.  This time, the cache drives are still mounted and part of the pool.  

 

In the past week, i've run memtest for 10 passes (>24 hrs) and reseated ram and sata cables. I haven't the slightest idea why this keeps happening.  A little guidance would be helpful.  Thank you.  

PXL_20210608_141905941.jpg

Server Down.zip

Link to comment
7 minutes ago, JorgeB said:

Diags after rebooting are not much help for this, you can try enabling syslog mirror to flash then post that log after a crash.

 

ok.  does this option write to my usb boot drive?  

 

I added an additional usb thumb drive to my server.  Local sys log is diabled, remote syslog is blank, mirror syslog to flash is set to yes.  Looks ok?

Link to comment
13 hours ago, JorgeB said:

Diags after rebooting are not much help for this, you can try enabling syslog mirror to flash then post that log after a crash.

 

Happened again tonight.  Had to powercycle the server. 

 

This is what I see in syslog before the kernel panic:

 

Jun  8 19:12:30 Nasgard emhttpd: spinning down /dev/sdm
Jun  8 19:57:31 Nasgard emhttpd: spinning down /dev/sdm
Jun  8 20:42:32 Nasgard emhttpd: spinning down /dev/sdm
Jun  8 20:43:27 Nasgard kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun  8 20:43:27 Nasgard kernel: rcu:     3-....: (59999 ticks this GP) idle=0fe/1/0x4000000000000000 softirq=1334521/1334530 fqs=14397 
Jun  8 20:43:27 Nasgard kernel:     (t=60001 jiffies g=3697701 q=125290)
Jun  8 20:43:27 Nasgard kernel: NMI backtrace for cpu 3
Jun  8 20:43:27 Nasgard kernel: CPU: 3 PID: 24985 Comm: 7 Not tainted 5.10.28-Unraid #1
Jun  8 20:43:27 Nasgard kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS PRO WIFI/B450 AORUS PRO WIFI-CF, BIOS F60e 12/09/2020
Jun  8 20:43:27 Nasgard kernel: Call Trace:
Jun  8 20:43:27 Nasgard kernel: <IRQ>
Jun  8 20:43:27 Nasgard kernel: dump_stack+0x6b/0x83
Jun  8 20:43:27 Nasgard kernel: ? lapic_can_unplug_cpu+0x8e/0x8e
Jun  8 20:43:27 Nasgard kernel: nmi_cpu_backtrace+0x7d/0x8f
Jun  8 20:43:27 Nasgard kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
Jun  8 20:43:27 Nasgard kernel: rcu_dump_cpu_stacks+0x9f/0xc6
Jun  8 20:43:27 Nasgard kernel: rcu_sched_clock_irq+0x1ec/0x543
Jun  8 20:43:27 Nasgard kernel: ? trigger_load_balance+0x5a/0x1ca
Jun  8 20:43:27 Nasgard kernel: update_process_times+0x50/0x6e
Jun  8 20:43:27 Nasgard kernel: tick_sched_timer+0x36/0x64
Jun  8 20:43:27 Nasgard kernel: __hrtimer_run_queues+0xb7/0x10b
Jun  8 20:43:27 Nasgard kernel: ? tick_sched_do_timer+0x39/0x39
Jun  8 20:43:27 Nasgard kernel: hrtimer_interrupt+0x8d/0x15b
Jun  8 20:43:27 Nasgard kernel: __sysvec_apic_timer_interrupt+0x5d/0x68
Jun  8 20:43:27 Nasgard kernel: asm_call_irq_on_stack+0x12/0x20
Jun  8 20:43:27 Nasgard kernel: </IRQ>
Jun  8 20:43:27 Nasgard kernel: sysvec_apic_timer_interrupt+0x71/0x95
Jun  8 20:43:27 Nasgard kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Jun  8 20:43:27 Nasgard kernel: RIP: 0010:xas_descend+0x45/0x49
Jun  8 20:43:27 Nasgard kernel: Code: 44 c6 08 48 89 77 18 4c 89 c7 e8 77 ff ff ff 84 c0 74 13 49 c1 e8 02 44 89 c1 45 89 c0 49 83 c0 04 4e 8b 44 c6 08 41 88 49 12 <4c> 89 c0 c3 4c 8b 4f 18 48 89 fe 41 f6 c1 03 75 6b 4d 85 c9 4c 8b
Jun  8 20:43:27 Nasgard kernel: RSP: 0018:ffffc90001467ba8 EFLAGS: 00000246
Jun  8 20:43:27 Nasgard kernel: RAX: 0000000000000000 RBX: ffff888106325370 RCX: 0000000000000007
Jun  8 20:43:27 Nasgard kernel: RDX: 0000000000000000 RSI: ffff888100686480 RDI: ffffea000464dd80
Jun  8 20:43:27 Nasgard kernel: RBP: 0000000000000b47 R08: ffffea000464dd80 R09: ffffc90001467bb8
Jun  8 20:43:27 Nasgard kernel: R10: ffffc90001467bb8 R11: ffff88842e8e2400 R12: ffffea000464dd80
Jun  8 20:43:27 Nasgard kernel: R13: ffff88810005c480 R14: 0000000000000b47 R15: 00000000ffffffe5
Jun  8 20:43:27 Nasgard kernel: ? xas_descend+0x2a/0x49
Jun  8 20:43:27 Nasgard kernel: xas_load+0x2d/0x39
Jun  8 20:43:27 Nasgard kernel: find_get_entry+0x57/0xba
Jun  8 20:43:27 Nasgard kernel: find_lock_entry+0x15/0x4a
Jun  8 20:43:27 Nasgard kernel: shmem_getpage_gfp.isra.0+0xf6/0x543
Jun  8 20:43:27 Nasgard kernel: shmem_getpage+0x12/0x14
Jun  8 20:43:27 Nasgard kernel: shmem_file_read_iter+0xaa/0x22f
Jun  8 20:43:27 Nasgard kernel: generic_file_splice_read+0xf0/0x15e
Jun  8 20:43:27 Nasgard kernel: splice_direct_to_actor+0xe4/0x1cd
Jun  8 20:43:27 Nasgard kernel: ? generic_file_splice_read+0x15e/0x15e
Jun  8 20:43:27 Nasgard kernel: do_splice_direct+0x94/0xbd
Jun  8 20:43:27 Nasgard kernel: do_sendfile+0x185/0x24f
Jun  8 20:43:27 Nasgard kernel: __do_sys_sendfile64+0x81/0xa7
Jun  8 20:43:27 Nasgard kernel: do_syscall_64+0x5d/0x6a
Jun  8 20:43:27 Nasgard kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun  8 20:43:27 Nasgard kernel: RIP: 0033:0x951dca


Full syslog attached.

syslog

Link to comment
18 minutes ago, SpeedyRS2 said:

I had the same problems, since I upgraded from 6.8.3 to 6.9.2. Permanent crashes (every day) . Syslog was not helpful.
Yesterday I downgraded back to 6.8.3. I need Unraid and Docker for my smarthome. Stability is very important here.

 

Just downgraded to 6.8.3. It unassigned my cache drives oddly enough, but they're back up now. 

 

After a bit of snooping, seeing similar kernel errors on 6.9.2. 

 

Im running a parity check now and will see how it goes. Let me know if you find an update that works for you. 

 

Link to comment
18 minutes ago, omartian said:

 

Just downgraded to 6.8.3. It unassigned my cache drives oddly enough, but they're back up now. 

 

After a bit of snooping, seeing similar kernel errors on 6.9.2. 

 

Im running a parity check now and will see how it goes. Let me know if you find an update that works for you. 

 

My Cashdrive was also not mounted. All my dockers were gone. at that moment I almost had a heart attack. But then I realized, that the cachdrive was unassigned. 
After assining it, everything was fine. In any case, I will not make any more attempts to update the system at the moment. The stability is too important for me.

Link to comment
16 hours ago, Staite said:

I had similar crashes on my new Server, one after six days than another after nine days running.

Disabled all my d ockers that used br0 net, now running since almost 17 days without crash.

crash.jpg

I have similar experiences. Had the macvaln crashes with the "CALL TRACE" in the syslog. crashes every day since I was an 6.9.2 (coming direct von 6.8.3) .  Revmoved all fixed assigned IPs from Docker container in the br0 network.
Then no crash for over a week. But suddenly the crashes were back. Every day. I had no idea what the reason was. No more CALL TRACE in the log. That was the time, where I downgraded back to 6.8.3. Since that downgrade unraid is stable again.

 

So I don`t think, that theses crashes are Hardware related.

  • Like 1
Link to comment
57 minutes ago, SpeedyRS2 said:

I have similar experiences. Had the macvaln crashes with the "CALL TRACE" in the syslog. crashes every day since I was an 6.9.2 (coming direct von 6.8.3) .  Revmoved all fixed assigned IPs from Docker container in the br0 network.
Then no crash for over a week. But suddenly the crashes were back. Every day. I had no idea what the reason was. No more CALL TRACE in the log. That was the time, where I downgraded back to 6.8.3. Since that downgrade unraid is stable again.

 

So I don`t think, that theses crashes are Hardware related.

 

I'm hoping that's the case for me. Currently running a parity check with zero errors and that the system has not crashed in the last 36 hours.  Fingers crossed. 

Link to comment
On 6/11/2021 at 1:47 AM, SpeedyRS2 said:

I have similar experiences. Had the macvaln crashes with the "CALL TRACE" in the syslog. crashes every day since I was an 6.9.2 (coming direct von 6.8.3) .  Revmoved all fixed assigned IPs from Docker container in the br0 network.
Then no crash for over a week. But suddenly the crashes were back. Every day. I had no idea what the reason was. No more CALL TRACE in the log. That was the time, where I downgraded back to 6.8.3. Since that downgrade unraid is stable again.

 

So I don`t think, that theses crashes are Hardware related.

 

Server has been up for over 4 days w/o crashing.  Hoping it's the latest update.  Let me know if you have success on a future update.  

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.