Unraid Crashing


pfields
Go to solution Solved by pfields,

Recommended Posts

Hello, 

 

So first of all I know that it's not random but at the minute I can't find any rhyme or reason for the crashes. The first crash I had I checked the physical screen and it was all black with no life, nothing was working I got it working again with a reboot. This morning I woke up to see a Nginx 500 error when I tried to look at the GUI and had to reboot again.

 

I took the server to the office this morning to try and diagnose what's happening, I updated the BIOS and restarted. I left it to idle for about an hour and it seemed to have crashed again, no SSH, no GUI, no shares but some text on the screen.

 

screen-text.thumb.jpg.96028b60d6eccd812df4526c7349a015.jpg

 

I have attached diagnostics below but they were generated after the reboot. The syslog was being mirrored to flash and only has very little information before the crash. Could the kernel time error be causing this? 

 

tower-diagnostics-20211207-1127.zip

 

syslog

 

Any help would be much appreciated. 

 

Thanks

Link to comment

I set the Power Supply Idle Control to Typical Current Idle as suggested in the article and rebooted. 

 

Now the server seems to be in a weird state, its apparently working ok when I check the physical screen and I can ping it but no SSH, no GUI, no docker apps. 

 

Is there something else I should do with the C-States? 

Link to comment

Well it works to begin with but I have just reproduced the error twice. It works initially, but then becomes unresponsive over the network. It responds to ping but nothing else works/loads. 

 

The server hasn't crashed as I can still log in on the physical machine. I logged in and then didn't do anything for 10-15 minutes at which point it was unresponsive, the Syslog shows nothing after my successful login. 

Link to comment

Hmm seems that it might be the same crash as was happening before because I just checked to revert the change and it was set back to Auto. So the BIOS setting is reverting to Auto every time. 

 

Its a Gigabyte B450 AORUS M (rev. 1.1) if anyone knows any reason as to why this option would continue to revert to Auto. Its not the CMOS as the date and time is being held in memory. 

Link to comment

Ok i've worked out the issue with Gigabyte for the 'Typical Current Idle', there is two ways to access the option in the BIOS and one way resets and the other sticks. 

 

The server doesn't seem to crash like before, but I keep getting restarts, I will randomly log in and see the Uptime at 5 mins even though the server has been on for an hour for example. 

 

The only errors or warnings I can see in the log when I check is the following:

 

Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: Machine check events logged
Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000000000108
Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff81064b1e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1639488944 SOCKET 0 APIC 1 microcode 8001138
Dec 14 13:36:03 Tower kernel: floppy0: no floppy controllers found
Dec 14 13:36:03 Tower kernel: random: 7 urandom warning(s) missed due to ratelimiting
Dec 14 13:36:03 Tower kernel: ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20200925/utaddress-204)
Dec 14 13:36:07 Tower rpc.statd[1976]: Failed to read /var/lib/nfs/state: Success

 

What should I do going forward in order to try and diagnose the restarts? 

 

Cheers

Link to comment

The server crashed again after about 1h30, black screen on the physical monitor. Nothing responding. 

 

Can anyone decipher the following and also why am I getting these Clock Unsynchronized errors? 

 

 

Quote

Dec 14 14:02:34 Tower ntpd[2031]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec 14 15:19:31 Tower kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... 11-... } 62078 jiffies s: 289 root: 0x1/.
Dec 14 15:19:31 Tower kernel: rcu: blocking rcu_node structures: l=1:0-15:0xa00/.
Dec 14 15:19:31 Tower kernel: Task dump for CPU 9:
Dec 14 15:19:31 Tower kernel: task:Plex Media Serv state:R  running task     stack:    0 pid: 4850 ppid:  3545 flags:0x00000328
Dec 14 15:19:31 Tower kernel: Call Trace:
Dec 14 15:19:31 Tower kernel: ? smp_call_function_many_cond+0x272/0x285
Dec 14 15:19:31 Tower kernel: ? smp_call_function_many_cond+0x250/0x285
Dec 14 15:19:31 Tower kernel: ? flush_tlb_func_common.constprop.0+0xcc/0xcc
Dec 14 15:19:31 Tower kernel: ? native_flush_tlb_local+0x10/0x17
Dec 14 15:19:31 Tower kernel: ? __flush_tlb_others+0x5/0x8
Dec 14 15:19:31 Tower kernel: ? flush_tlb_mm_range+0xba/0xc0
Dec 14 15:19:31 Tower kernel: ? tlb_flush_mmu_tlbonly+0x6d/0x92
Dec 14 15:19:31 Tower kernel: ? tlb_flush_mmu+0xc/0x65
Dec 14 15:19:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54
Dec 14 15:19:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175
Dec 14 15:19:31 Tower kernel: ? find_vma+0xe/0x54
Dec 14 15:19:31 Tower kernel: ? find_vma_prev+0xf/0x3b
Dec 14 15:19:31 Tower kernel: ? do_madvise+0x578/0x86d
Dec 14 15:19:31 Tower kernel: ? __seccomp_filter+0x185/0x368
Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:19:31 Tower kernel: ? do_syscall_64+0x5d/0x6a
Dec 14 15:19:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 14 15:19:31 Tower kernel: Task dump for CPU 11:
Dec 14 15:19:31 Tower kernel: task:Plex Media Serv state:R  running task     stack:    0 pid: 4851 ppid:  3545 flags:0x00000328
Dec 14 15:19:31 Tower kernel: Call Trace:
Dec 14 15:19:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54
Dec 14 15:19:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175
Dec 14 15:19:31 Tower kernel: ? find_vma+0xe/0x54
Dec 14 15:19:31 Tower kernel: ? find_vma_prev+0xf/0x3b
Dec 14 15:19:31 Tower kernel: ? do_madvise+0x578/0x86d
Dec 14 15:19:31 Tower kernel: ? __seccomp_filter+0x185/0x368
Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:19:31 Tower kernel: ? do_syscall_64+0x5d/0x6a
Dec 14 15:19:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 14 15:22:31 Tower kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... 11-... } 242302 jiffies s: 289 root: 0x1/.
Dec 14 15:22:31 Tower kernel: rcu: blocking rcu_node structures: l=1:0-15:0xa00/.
Dec 14 15:22:31 Tower kernel: Task dump for CPU 9:
Dec 14 15:22:31 Tower kernel: task:Plex Media Serv state:R  running task     stack:    0 pid: 4850 ppid:  3545 flags:0x00000328
Dec 14 15:22:31 Tower kernel: Call Trace:
Dec 14 15:22:31 Tower kernel: ? smp_call_function_many_cond+0x272/0x285
Dec 14 15:22:31 Tower kernel: ? smp_call_function_many_cond+0x250/0x285
Dec 14 15:22:31 Tower kernel: ? flush_tlb_func_common.constprop.0+0xcc/0xcc
Dec 14 15:22:31 Tower kernel: ? native_flush_tlb_local+0x10/0x17
Dec 14 15:22:31 Tower kernel: ? __flush_tlb_others+0x5/0x8
Dec 14 15:22:31 Tower kernel: ? flush_tlb_mm_range+0xba/0xc0
Dec 14 15:22:31 Tower kernel: ? tlb_flush_mmu_tlbonly+0x6d/0x92
Dec 14 15:22:31 Tower kernel: ? tlb_flush_mmu+0xc/0x65
Dec 14 15:22:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54
Dec 14 15:22:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175
Dec 14 15:22:31 Tower kernel: ? find_vma+0xe/0x54
Dec 14 15:22:31 Tower kernel: ? find_vma_prev+0xf/0x3b
Dec 14 15:22:31 Tower kernel: ? do_madvise+0x578/0x86d
Dec 14 15:22:31 Tower kernel: ? __seccomp_filter+0x185/0x368
Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:22:31 Tower kernel: ? do_syscall_64+0x5d/0x6a
Dec 14 15:22:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 14 15:22:31 Tower kernel: Task dump for CPU 11:
Dec 14 15:22:31 Tower kernel: task:Plex Media Serv state:R  running task     stack:    0 pid: 4851 ppid:  3545 flags:0x00000328
Dec 14 15:22:31 Tower kernel: Call Trace:
Dec 14 15:22:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54
Dec 14 15:22:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175
Dec 14 15:22:31 Tower kernel: ? find_vma+0xe/0x54
Dec 14 15:22:31 Tower kernel: ? find_vma_prev+0xf/0x3b
Dec 14 15:22:31 Tower kernel: ? do_madvise+0x578/0x86d
Dec 14 15:22:31 Tower kernel: ? __seccomp_filter+0x185/0x368
Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:22:31 Tower kernel: ? do_syscall_64+0x5d/0x6a
Dec 14 15:22:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 14 15:25:31 Tower kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... 11-... } 422527 jiffies s: 289 root: 0x1/.
Dec 14 15:25:31 Tower kernel: rcu: blocking rcu_node structures: l=1:0-15:0xa00/.
Dec 14 15:25:31 Tower kernel: Task dump for CPU 9:
Dec 14 15:25:31 Tower kernel: task:Plex Media Serv state:R  running task     stack:    0 pid: 4850 ppid:  3545 flags:0x00000328
Dec 14 15:25:31 Tower kernel: Call Trace:
Dec 14 15:25:31 Tower kernel: ? smp_call_function_many_cond+0x26c/0x285
Dec 14 15:25:31 Tower kernel: ? smp_call_function_many_cond+0x250/0x285
Dec 14 15:25:31 Tower kernel: ? flush_tlb_func_common.constprop.0+0xcc/0xcc
Dec 14 15:25:31 Tower kernel: ? native_flush_tlb_local+0x10/0x17
Dec 14 15:25:31 Tower kernel: ? __flush_tlb_others+0x5/0x8
Dec 14 15:25:31 Tower kernel: ? flush_tlb_mm_range+0xba/0xc0
Dec 14 15:25:31 Tower kernel: ? tlb_flush_mmu_tlbonly+0x6d/0x92
Dec 14 15:25:31 Tower kernel: ? tlb_flush_mmu+0xc/0x65
Dec 14 15:25:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54
Dec 14 15:25:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175
Dec 14 15:25:31 Tower kernel: ? find_vma+0xe/0x54
Dec 14 15:25:31 Tower kernel: ? find_vma_prev+0xf/0x3b
Dec 14 15:25:31 Tower kernel: ? do_madvise+0x578/0x86d
Dec 14 15:25:31 Tower kernel: ? __seccomp_filter+0x185/0x368
Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:25:31 Tower kernel: ? do_syscall_64+0x5d/0x6a
Dec 14 15:25:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 14 15:25:31 Tower kernel: Task dump for CPU 11:
Dec 14 15:25:31 Tower kernel: task:Plex Media Serv state:R  running task     stack:    0 pid: 4851 ppid:  3545 flags:0x00000328
Dec 14 15:25:31 Tower kernel: Call Trace:
Dec 14 15:25:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54
Dec 14 15:25:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175
Dec 14 15:25:31 Tower kernel: ? find_vma+0xe/0x54
Dec 14 15:25:31 Tower kernel: ? find_vma_prev+0xf/0x3b
Dec 14 15:25:31 Tower kernel: ? do_madvise+0x578/0x86d
Dec 14 15:25:31 Tower kernel: ? __seccomp_filter+0x185/0x368
Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24
Dec 14 15:25:31 Tower kernel: ? do_syscall_64+0x5d/0x6a
Dec 14 15:25:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
 

 

Link to comment
5 hours ago, XceRpt said:

a long shot but always worth checking as far as random crashes are concerned. Dbl check your ram speed settings. I ran for nearly a year then suddenly started having crashes. changing the speed to the stock non oc speed of 2133 instead of 3200 stopped the crashes for me.

 

 

 

I set XMP Profile 1 this morning to 2400Mhz but has just crashed. 

 

There are always a few lines in Sylog before it also goes dead. 

 

Dec 15 08:21:04 Tower nmbd[2659]: [2021/12/15 08:21:04.787064,  0] ../../source3/nmbd/nmbd_become_lmb.c:397(become_local_master_stage2)
Dec 15 08:21:04 Tower nmbd[2659]:   *****
Dec 15 08:21:04 Tower nmbd[2659]:   
Dec 15 08:21:04 Tower nmbd[2659]:   Samba name server TOWER is now a local master browser for workgroup WORKGROUP on subnet 172.17.0.1
Dec 15 08:21:04 Tower nmbd[2659]:   
Dec 15 08:21:04 Tower nmbd[2659]:   *****
Dec 15 08:25:54 Tower ntpd[2029]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

 

I assume this has nothing to do with time settings in unraid because NTP within UNraid shows the right time and I can communicate correctly with time1.google.com. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.