pfields Posted December 7, 2021 Share Posted December 7, 2021 Hello, So first of all I know that it's not random but at the minute I can't find any rhyme or reason for the crashes. The first crash I had I checked the physical screen and it was all black with no life, nothing was working I got it working again with a reboot. This morning I woke up to see a Nginx 500 error when I tried to look at the GUI and had to reboot again. I took the server to the office this morning to try and diagnose what's happening, I updated the BIOS and restarted. I left it to idle for about an hour and it seemed to have crashed again, no SSH, no GUI, no shares but some text on the screen. I have attached diagnostics below but they were generated after the reboot. The syslog was being mirrored to flash and only has very little information before the crash. Could the kernel time error be causing this? tower-diagnostics-20211207-1127.zip syslog Any help would be much appreciated. Thanks Quote Link to comment
JorgeB Posted December 7, 2021 Share Posted December 7, 2021 Is the power supply control option correctly set? Quote Link to comment
pfields Posted December 7, 2021 Author Share Posted December 7, 2021 I set the Power Supply Idle Control to Typical Current Idle as suggested in the article and rebooted. Now the server seems to be in a weird state, its apparently working ok when I check the physical screen and I can ping it but no SSH, no GUI, no docker apps. Is there something else I should do with the C-States? Quote Link to comment
JorgeB Posted December 7, 2021 Share Posted December 7, 2021 Almost certainly unrelated to that, but you can try reverting the setting to make sure. Quote Link to comment
pfields Posted December 7, 2021 Author Share Posted December 7, 2021 Well it works to begin with but I have just reproduced the error twice. It works initially, but then becomes unresponsive over the network. It responds to ping but nothing else works/loads. The server hasn't crashed as I can still log in on the physical machine. I logged in and then didn't do anything for 10-15 minutes at which point it was unresponsive, the Syslog shows nothing after my successful login. Quote Link to comment
JorgeB Posted December 7, 2021 Share Posted December 7, 2021 And it works normally if you revert the setting? Quote Link to comment
pfields Posted December 7, 2021 Author Share Posted December 7, 2021 Hmm seems that it might be the same crash as was happening before because I just checked to revert the change and it was set back to Auto. So the BIOS setting is reverting to Auto every time. Its a Gigabyte B450 AORUS M (rev. 1.1) if anyone knows any reason as to why this option would continue to revert to Auto. Its not the CMOS as the date and time is being held in memory. Quote Link to comment
JorgeB Posted December 7, 2021 Share Posted December 7, 2021 Look for a BIOS update. Quote Link to comment
pfields Posted December 7, 2021 Author Share Posted December 7, 2021 I'm on the most current 62d, I even tried downgrading to revision 61 but it seems to have the same behaviour. I've logged a ticket with Gigabyte. Quote Link to comment
pfields Posted December 14, 2021 Author Share Posted December 14, 2021 Ok i've worked out the issue with Gigabyte for the 'Typical Current Idle', there is two ways to access the option in the BIOS and one way resets and the other sticks. The server doesn't seem to crash like before, but I keep getting restarts, I will randomly log in and see the Uptime at 5 mins even though the server has been on for an hour for example. The only errors or warnings I can see in the log when I check is the following: Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: Machine check events logged Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000000000108 Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff81064b1e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 Dec 14 13:36:03 Tower kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1639488944 SOCKET 0 APIC 1 microcode 8001138 Dec 14 13:36:03 Tower kernel: floppy0: no floppy controllers found Dec 14 13:36:03 Tower kernel: random: 7 urandom warning(s) missed due to ratelimiting Dec 14 13:36:03 Tower kernel: ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20200925/utaddress-204) Dec 14 13:36:07 Tower rpc.statd[1976]: Failed to read /var/lib/nfs/state: Success What should I do going forward in order to try and diagnose the restarts? Cheers Quote Link to comment
JorgeB Posted December 14, 2021 Share Posted December 14, 2021 It's detecting a hardware issue, not sure if it's a serious one or not, but if it's restarting it probably is. Quote Link to comment
pfields Posted December 14, 2021 Author Share Posted December 14, 2021 The server crashed again after about 1h30, black screen on the physical monitor. Nothing responding. Can anyone decipher the following and also why am I getting these Clock Unsynchronized errors? Quote Dec 14 14:02:34 Tower ntpd[2031]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Dec 14 15:19:31 Tower kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... 11-... } 62078 jiffies s: 289 root: 0x1/. Dec 14 15:19:31 Tower kernel: rcu: blocking rcu_node structures: l=1:0-15:0xa00/. Dec 14 15:19:31 Tower kernel: Task dump for CPU 9: Dec 14 15:19:31 Tower kernel: task:Plex Media Serv state:R running task stack: 0 pid: 4850 ppid: 3545 flags:0x00000328 Dec 14 15:19:31 Tower kernel: Call Trace: Dec 14 15:19:31 Tower kernel: ? smp_call_function_many_cond+0x272/0x285 Dec 14 15:19:31 Tower kernel: ? smp_call_function_many_cond+0x250/0x285 Dec 14 15:19:31 Tower kernel: ? flush_tlb_func_common.constprop.0+0xcc/0xcc Dec 14 15:19:31 Tower kernel: ? native_flush_tlb_local+0x10/0x17 Dec 14 15:19:31 Tower kernel: ? __flush_tlb_others+0x5/0x8 Dec 14 15:19:31 Tower kernel: ? flush_tlb_mm_range+0xba/0xc0 Dec 14 15:19:31 Tower kernel: ? tlb_flush_mmu_tlbonly+0x6d/0x92 Dec 14 15:19:31 Tower kernel: ? tlb_flush_mmu+0xc/0x65 Dec 14 15:19:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54 Dec 14 15:19:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175 Dec 14 15:19:31 Tower kernel: ? find_vma+0xe/0x54 Dec 14 15:19:31 Tower kernel: ? find_vma_prev+0xf/0x3b Dec 14 15:19:31 Tower kernel: ? do_madvise+0x578/0x86d Dec 14 15:19:31 Tower kernel: ? __seccomp_filter+0x185/0x368 Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:19:31 Tower kernel: ? do_syscall_64+0x5d/0x6a Dec 14 15:19:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 14 15:19:31 Tower kernel: Task dump for CPU 11: Dec 14 15:19:31 Tower kernel: task:Plex Media Serv state:R running task stack: 0 pid: 4851 ppid: 3545 flags:0x00000328 Dec 14 15:19:31 Tower kernel: Call Trace: Dec 14 15:19:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54 Dec 14 15:19:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175 Dec 14 15:19:31 Tower kernel: ? find_vma+0xe/0x54 Dec 14 15:19:31 Tower kernel: ? find_vma_prev+0xf/0x3b Dec 14 15:19:31 Tower kernel: ? do_madvise+0x578/0x86d Dec 14 15:19:31 Tower kernel: ? __seccomp_filter+0x185/0x368 Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:19:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:19:31 Tower kernel: ? do_syscall_64+0x5d/0x6a Dec 14 15:19:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 14 15:22:31 Tower kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... 11-... } 242302 jiffies s: 289 root: 0x1/. Dec 14 15:22:31 Tower kernel: rcu: blocking rcu_node structures: l=1:0-15:0xa00/. Dec 14 15:22:31 Tower kernel: Task dump for CPU 9: Dec 14 15:22:31 Tower kernel: task:Plex Media Serv state:R running task stack: 0 pid: 4850 ppid: 3545 flags:0x00000328 Dec 14 15:22:31 Tower kernel: Call Trace: Dec 14 15:22:31 Tower kernel: ? smp_call_function_many_cond+0x272/0x285 Dec 14 15:22:31 Tower kernel: ? smp_call_function_many_cond+0x250/0x285 Dec 14 15:22:31 Tower kernel: ? flush_tlb_func_common.constprop.0+0xcc/0xcc Dec 14 15:22:31 Tower kernel: ? native_flush_tlb_local+0x10/0x17 Dec 14 15:22:31 Tower kernel: ? __flush_tlb_others+0x5/0x8 Dec 14 15:22:31 Tower kernel: ? flush_tlb_mm_range+0xba/0xc0 Dec 14 15:22:31 Tower kernel: ? tlb_flush_mmu_tlbonly+0x6d/0x92 Dec 14 15:22:31 Tower kernel: ? tlb_flush_mmu+0xc/0x65 Dec 14 15:22:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54 Dec 14 15:22:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175 Dec 14 15:22:31 Tower kernel: ? find_vma+0xe/0x54 Dec 14 15:22:31 Tower kernel: ? find_vma_prev+0xf/0x3b Dec 14 15:22:31 Tower kernel: ? do_madvise+0x578/0x86d Dec 14 15:22:31 Tower kernel: ? __seccomp_filter+0x185/0x368 Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:22:31 Tower kernel: ? do_syscall_64+0x5d/0x6a Dec 14 15:22:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 14 15:22:31 Tower kernel: Task dump for CPU 11: Dec 14 15:22:31 Tower kernel: task:Plex Media Serv state:R running task stack: 0 pid: 4851 ppid: 3545 flags:0x00000328 Dec 14 15:22:31 Tower kernel: Call Trace: Dec 14 15:22:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54 Dec 14 15:22:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175 Dec 14 15:22:31 Tower kernel: ? find_vma+0xe/0x54 Dec 14 15:22:31 Tower kernel: ? find_vma_prev+0xf/0x3b Dec 14 15:22:31 Tower kernel: ? do_madvise+0x578/0x86d Dec 14 15:22:31 Tower kernel: ? __seccomp_filter+0x185/0x368 Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:22:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:22:31 Tower kernel: ? do_syscall_64+0x5d/0x6a Dec 14 15:22:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 14 15:25:31 Tower kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... 11-... } 422527 jiffies s: 289 root: 0x1/. Dec 14 15:25:31 Tower kernel: rcu: blocking rcu_node structures: l=1:0-15:0xa00/. Dec 14 15:25:31 Tower kernel: Task dump for CPU 9: Dec 14 15:25:31 Tower kernel: task:Plex Media Serv state:R running task stack: 0 pid: 4850 ppid: 3545 flags:0x00000328 Dec 14 15:25:31 Tower kernel: Call Trace: Dec 14 15:25:31 Tower kernel: ? smp_call_function_many_cond+0x26c/0x285 Dec 14 15:25:31 Tower kernel: ? smp_call_function_many_cond+0x250/0x285 Dec 14 15:25:31 Tower kernel: ? flush_tlb_func_common.constprop.0+0xcc/0xcc Dec 14 15:25:31 Tower kernel: ? native_flush_tlb_local+0x10/0x17 Dec 14 15:25:31 Tower kernel: ? __flush_tlb_others+0x5/0x8 Dec 14 15:25:31 Tower kernel: ? flush_tlb_mm_range+0xba/0xc0 Dec 14 15:25:31 Tower kernel: ? tlb_flush_mmu_tlbonly+0x6d/0x92 Dec 14 15:25:31 Tower kernel: ? tlb_flush_mmu+0xc/0x65 Dec 14 15:25:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54 Dec 14 15:25:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175 Dec 14 15:25:31 Tower kernel: ? find_vma+0xe/0x54 Dec 14 15:25:31 Tower kernel: ? find_vma_prev+0xf/0x3b Dec 14 15:25:31 Tower kernel: ? do_madvise+0x578/0x86d Dec 14 15:25:31 Tower kernel: ? __seccomp_filter+0x185/0x368 Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:25:31 Tower kernel: ? do_syscall_64+0x5d/0x6a Dec 14 15:25:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 14 15:25:31 Tower kernel: Task dump for CPU 11: Dec 14 15:25:31 Tower kernel: task:Plex Media Serv state:R running task stack: 0 pid: 4851 ppid: 3545 flags:0x00000328 Dec 14 15:25:31 Tower kernel: Call Trace: Dec 14 15:25:31 Tower kernel: ? tlb_finish_mmu+0x27/0x54 Dec 14 15:25:31 Tower kernel: ? madvise_free_single_vma+0x151/0x175 Dec 14 15:25:31 Tower kernel: ? find_vma+0xe/0x54 Dec 14 15:25:31 Tower kernel: ? find_vma_prev+0xf/0x3b Dec 14 15:25:31 Tower kernel: ? do_madvise+0x578/0x86d Dec 14 15:25:31 Tower kernel: ? __seccomp_filter+0x185/0x368 Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:25:31 Tower kernel: ? __x64_sys_madvise+0x21/0x24 Dec 14 15:25:31 Tower kernel: ? do_syscall_64+0x5d/0x6a Dec 14 15:25:31 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Quote Link to comment
XceRpt Posted December 15, 2021 Share Posted December 15, 2021 a long shot but always worth checking as far as random crashes are concerned. Dbl check your ram speed settings. I ran for nearly a year then suddenly started having crashes. changing the speed to the stock non oc speed of 2133 instead of 3200 stopped the crashes for me. Quote Link to comment
pfields Posted December 15, 2021 Author Share Posted December 15, 2021 5 hours ago, XceRpt said: a long shot but always worth checking as far as random crashes are concerned. Dbl check your ram speed settings. I ran for nearly a year then suddenly started having crashes. changing the speed to the stock non oc speed of 2133 instead of 3200 stopped the crashes for me. I set XMP Profile 1 this morning to 2400Mhz but has just crashed. There are always a few lines in Sylog before it also goes dead. Dec 15 08:21:04 Tower nmbd[2659]: [2021/12/15 08:21:04.787064, 0] ../../source3/nmbd/nmbd_become_lmb.c:397(become_local_master_stage2) Dec 15 08:21:04 Tower nmbd[2659]: ***** Dec 15 08:21:04 Tower nmbd[2659]: Dec 15 08:21:04 Tower nmbd[2659]: Samba name server TOWER is now a local master browser for workgroup WORKGROUP on subnet 172.17.0.1 Dec 15 08:21:04 Tower nmbd[2659]: Dec 15 08:21:04 Tower nmbd[2659]: ***** Dec 15 08:25:54 Tower ntpd[2029]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized I assume this has nothing to do with time settings in unraid because NTP within UNraid shows the right time and I can communicate correctly with time1.google.com. Quote Link to comment
pfields Posted December 15, 2021 Author Share Posted December 15, 2021 As a precaution, I just purchased a new USB drive to make sure the Kernel panics aren't from the USB. Can anyone make anything of the kernel panic below? Quote Link to comment
pfields Posted December 15, 2021 Author Share Posted December 15, 2021 I turned off Syslog mirroring this morning and opted to use a remote syslog server instead and haven't had a crash since. This leads me to believe that ordering a new USB key is a good bet in the short term. Quote Link to comment
Solution pfields Posted December 18, 2021 Author Solution Share Posted December 18, 2021 I was still suffering from crashes after changing the USB drive. In the end I changed the Ryzen 7 1700 for a Ryzen 5 2600 and haven't had a crash in over 19 hours. Seems like Unraid is still not playing well with first gen Ryzen even with the BIOS options set. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.