K1ng0011 Posted December 15, 2020 Share Posted December 15, 2020 (edited) I am running unraid 6.8.3. I have been having a lot of problems with crashes lately. Once every few days. When the crash occurs I lose the web GUI, SSH, and the console. I have a monitor hooked up to my server and I want to run the command "tail /var/log/syslog -f" in the console to figure out what is going on. However it appears I am doing something wrong or my console is having issues. When I boot up my server as soon as I type in my username the console freezes. I can no longer type in the console until I reboot the server. The web gui and everything else continues to work until the server decides to crash again. Motherboard: MSI Pro Carbon X370 RAM: 16GB DDR4 CPU: Ryzen 5 2600X GPU: Nvidia 1660 Edited December 15, 2020 by K1ng0011 Adding Information Quote Link to comment
JorgeB Posted December 15, 2020 Share Posted December 15, 2020 Start with this: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 If all is as it should and it still crashes enable the syslog mirror/server. Quote Link to comment
K1ng0011 Posted December 15, 2020 Author Share Posted December 15, 2020 JorgeB thank you for the link. I applied that bios setting last night along with turning off global c-states before I posted on the form. Hopefully that will help to resolve my issues. Also my ram and processor are not overclocked. The XMP profile is turned off for the ram. Since I can't use the console I started an ssh session with the command tail var/log/syslog -f running on a sperate computer. Quote Link to comment
K1ng0011 Posted December 16, 2020 Author Share Posted December 16, 2020 I think I found an issue. My unraid server has not crashed yet but I looked at the logs from the command var/log/syslog -f running on an ssh session and I am seeing a lot of BTRFS errors. From what I have read is that I have some kind of corruption of the BTRFS filesystem on my 1tb nvme cache drive. It could be possible this has been causing my issues or just a symptom of me having to hard power off my server when the whole system locks up. I am backing up my appdata and I will will format my drive when that has completed and remove and recreate the docker image. Dec 15 17:47:51 Tower kernel: BTRFS: error (device nvme0n1p1) in __btrfs_free_extent:6805: errno=-117 unknown Dec 15 17:47:51 Tower kernel: BTRFS info (device nvme0n1p1): forced readonly Dec 15 17:47:51 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_run_delayed_refs:2935: errno=-117 unknown Dec 15 17:47:51 Tower kernel: print_req_error: I/O error, dev loop2, sector 0 Dec 15 17:47:51 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 Dec 15 17:47:51 Tower kernel: BTRFS warning (device loop2): chunk 13631488 missing 1 devices, max tolerance is 0 for writeable mount Dec 15 17:47:51 Tower kernel: BTRFS: error (device loop2) in write_all_supers:3717: errno=-5 IO failure (errors while submitting device barriers.) Dec 15 17:47:51 Tower kernel: BTRFS error (device nvme0n1p1): pending csums is 16384 Dec 15 17:47:51 Tower kernel: BTRFS info (device loop2): forced readonly Dec 15 17:47:51 Tower kernel: BTRFS warning (device loop2): Skipping commit of aborted transaction. Dec 15 17:47:51 Tower kernel: BTRFS: error (device loop2) in cleanup_transaction:1860: errno=-5 IO failure Dec 15 20:08:26 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 603, rd 0, flush 1, corrupt 0, gen 0 Dec 15 20:08:26 Tower kernel: loop: Write error at byte offset 3941613568, length 4096. Dec 15 20:08:26 Tower kernel: print_req_error: I/O error, dev loop2, sector 7698464 Dec 15 20:08:26 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 604, rd 0, flush 1, corrupt 0, gen 0 Dec 15 20:08:26 Tower kernel: print_req_error: I/O error, dev loop2, sector 7697872 Dec 15 20:08:26 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 605, rd 0, flush 1, corrupt 0, gen 0 Dec 15 20:08:56 Tower kernel: loop: Write error at byte offset 3687092224, length 4096. Dec 15 20:08:56 Tower kernel: print_req_error: I/O error, dev loop2, sector 7201352 Dec 15 20:08:56 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 606, rd 0, flush 1, corrupt 0, gen 0 Dec 15 20:08:56 Tower kernel: loop: Write error at byte offset 3941310464, length 4096. Dec 15 20:08:56 Tower kernel: print_req_error: I/O error, dev loop2, sector 7697872 Dec 15 20:08:56 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 607, rd 0, flush 1, corrupt 0, gen 0 Dec 15 20:08:56 Tower kernel: loop: Write error at byte offset 3941613568, length 4096. Dec 15 20:08:56 Tower kernel: print_req_error: I/O error, dev loop2, sector 7698464 Dec 15 20:08:56 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 608, rd 0, flush 1, corrupt 0, gen 0 Quote Link to comment
K1ng0011 Posted December 16, 2020 Author Share Posted December 16, 2020 Last night I formatted my m.2 nvme drive as XFS it was BTRFS before. I will watch for further errors via the ssh session I have going to see if the server locks up again and I will post my findings. Sometimes it takes a few days so I may not post for a while. Quote Link to comment
K1ng0011 Posted December 16, 2020 Author Share Posted December 16, 2020 Also I have a LSI 9201-16i. It was on an older firmware and bios F:19.00.00.00 B:7.37.00.00. I just loaded the newest bios and firmware on the LSI card. I dont think this has been causing my lockups but might as well rule it out a potential cause Quote Link to comment
K1ng0011 Posted December 18, 2020 Author Share Posted December 18, 2020 Well that did not take as long as I thought. The server did not hard lock up. However I woke up this morning and my dockers stopped working. I looked at the logs and the only errors that I see are call traces. I will post a diagnostic later today. Based on they are showing it appears to be related to the NIC and possibly the macvlan. On some of my docker containers I do run a separate IP address than the host IP. I am not sure if this is due to the separate 10gig NIC I have and it has compatibility issues with macvlan or not. I think there are two options at this point that I can try. Disable the dockers with custom IPs or removed my PCIE 10gig NIC and see if I get the same issue or not with the custom IP dockers enabled using the onboard 1 gig NIC. Quote Link to comment
K1ng0011 Posted December 20, 2020 Author Share Posted December 20, 2020 My errors seem to be related to call traces. I am not experienced enough to tell you exactly what they mean but I do see a lot of network related events and I think it is relate to the MACVLAN issue some people can experience with some hardware. I have referenced another thread below where Hoopster reported similar issues. I have currently removed my 10gig PCIE NIC and I am running off my motherboard NIC. We will see if that makes any difference in the call traces. If that does not work I will disable all dockers with separate IPs. I am trying to narrow down what is causing the issue. 10 Gig NIC: ASUS XG-C100C Call Trace Issues Thread: Dec 19 14:15:10 Tower kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Dec 19 14:15:10 Tower kernel: caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs Dec 19 14:15:12 Tower kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Dec 19 14:15:12 Tower kernel: caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs Dec 19 17:54:27 Tower webGUI: Successful login user root from 10.45.45.123 Dec 19 23:30:03 Tower webGUI: Successful login user root from 10.45.45.123 Dec 19 23:52:25 Tower kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Dec 19 23:52:25 Tower kernel: caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs Dec 20 01:44:14 Tower kernel: WARNING: CPU: 3 PID: 4818 at net/netfilter/nf_conntrack_core.c:945 __nf_conntrack_confirm+0x97/0x6b4 Dec 20 01:44:14 Tower kernel: Modules linked in: tun xt_nat macvlan nvidia_uvm(O) xt_CHECKSUM ipt_MASQUERADE ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter ip_tables xfs md_mod nct6775 hwmon_vid edac_mce_amd nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel mpt3sas aes_x86_64 crypto_simd cryptd agpgart atlantic syscopyarea sysfillrect glue_helper i2c_piix4 k10temp mxm_wmi nvme i2c_core sysimgblt raid_class wmi_bmof fb_sys_fops ahci scsi_transport_sas libahci nvme_core pcc_cpufreq wmi button acpi_cpufreq [last unloaded: ccp] Dec 20 01:44:14 Tower kernel: CPU: 3 PID: 4818 Comm: avahi-daemon Tainted: P O 4.19.107-Unraid #1 Dec 20 01:44:14 Tower kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A32/X370 GAMING PRO CARBON (MS-7A32), BIOS 1.L0 01/21/2019 Dec 20 01:44:14 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0x97/0x6b4 Dec 20 01:44:14 Tower kernel: Code: c1 ed 20 89 2c 24 e8 5f fb ff ff 8b 54 24 04 89 ef 89 c6 41 89 c4 e8 8f f9 ff ff 84 c0 75 b9 49 8b 86 80 00 00 00 a8 08 74 25 <0f> 0b 44 89 e6 89 ef 45 31 ff e8 5d f1 ff ff be 00 02 00 00 48 c7 Dec 20 01:44:14 Tower kernel: RSP: 0018:ffff88840eac38f0 EFLAGS: 00010202 Dec 20 01:44:14 Tower kernel: RAX: 0000000000000188 RBX: ffff888103a72a00 RCX: 0000000038e3383a Dec 20 01:44:14 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff81e08fc8 Dec 20 01:44:14 Tower kernel: RBP: 0000000000002272 R08: ffff88811265a6b0 R09: 000000008101c937 Dec 20 01:44:14 Tower kernel: R10: 0000000000000158 R11: ffffffff81e91080 R12: 0000000000000071 Dec 20 01:44:14 Tower kernel: R13: ffffffff81e91080 R14: ffff88811265a640 R15: ffff88811265a698 Dec 20 01:44:14 Tower kernel: FS: 000014820b965b80(0000) GS:ffff88840eac0000(0000) knlGS:0000000000000000 Dec 20 01:44:14 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 20 01:44:14 Tower kernel: CR2: 0000154f18532870 CR3: 000000040a900000 CR4: 00000000003406e0 Dec 20 01:44:14 Tower kernel: Call Trace: Dec 20 01:44:14 Tower kernel: <IRQ> Dec 20 01:44:14 Tower kernel: ipv4_confirm+0xaf/0xb7 Dec 20 01:44:14 Tower kernel: nf_hook_slow+0x37/0x96 Dec 20 01:44:14 Tower kernel: ip_local_deliver+0xa9/0xd7 Dec 20 01:44:14 Tower kernel: ? ip_sublist_rcv_finish+0x53/0x53 Dec 20 01:44:14 Tower kernel: ip_sabotage_in+0x38/0x3e Dec 20 01:44:14 Tower kernel: nf_hook_slow+0x37/0x96 Dec 20 01:44:14 Tower kernel: ip_rcv+0x8e/0xbe Dec 20 01:44:14 Tower kernel: ? ip_rcv_finish_core.isra.0+0x2e2/0x2e2 Dec 20 01:44:14 Tower kernel: __netif_receive_skb_one_core+0x4d/0x69 Dec 20 01:44:14 Tower kernel: netif_receive_skb_internal+0x79/0x94 Dec 20 01:44:14 Tower kernel: br_pass_frame_up+0x123/0x145 Dec 20 01:44:14 Tower kernel: ? br_port_flags_change+0x29/0x29 Dec 20 01:44:14 Tower kernel: br_handle_frame_finish+0x335/0x37a Dec 20 01:44:14 Tower kernel: ? ipt_do_table+0x5b6/0x603 [ip_tables] Dec 20 01:44:14 Tower kernel: ? br_pass_frame_up+0x145/0x145 Dec 20 01:44:14 Tower kernel: br_nf_hook_thresh+0xa3/0xc3 Dec 20 01:44:14 Tower kernel: ? br_pass_frame_up+0x145/0x145 Dec 20 01:44:14 Tower kernel: br_nf_pre_routing_finish+0x239/0x260 Dec 20 01:44:14 Tower kernel: ? br_pass_frame_up+0x145/0x145 Dec 20 01:44:14 Tower kernel: ? nf_nat_ipv4_in+0x1d/0x64 [nf_nat_ipv4] Dec 20 01:44:14 Tower kernel: br_nf_pre_routing+0x2fc/0x321 Dec 20 01:44:14 Tower kernel: ? br_nf_forward_ip+0x352/0x352 Dec 20 01:44:14 Tower kernel: nf_hook_slow+0x37/0x96 Dec 20 01:44:14 Tower kernel: br_handle_frame+0x290/0x2d3 Dec 20 01:44:14 Tower kernel: ? br_pass_frame_up+0x145/0x145 Dec 20 01:44:14 Tower kernel: ? br_handle_local_finish+0xe/0xe Dec 20 01:44:14 Tower kernel: __netif_receive_skb_core+0x4a9/0x7db Dec 20 01:44:14 Tower kernel: ? udp_gro_receive+0x4c/0x134 Dec 20 01:44:14 Tower kernel: __netif_receive_skb_one_core+0x31/0x69 Dec 20 01:44:14 Tower kernel: netif_receive_skb_internal+0x79/0x94 Dec 20 01:44:14 Tower kernel: napi_gro_receive+0x42/0x76 Dec 20 01:44:14 Tower kernel: aq_ring_rx_clean+0x32e/0x35c [atlantic] Dec 20 01:44:14 Tower kernel: ? hw_atl_b0_hw_ring_rx_receive+0x129/0x1f5 [atlantic] Dec 20 01:44:14 Tower kernel: aq_vec_poll+0xee/0x17d [atlantic] Dec 20 01:44:14 Tower kernel: net_rx_action+0x10b/0x274 Dec 20 01:44:14 Tower kernel: __do_softirq+0xce/0x1e2 Dec 20 01:44:14 Tower kernel: irq_exit+0x5e/0x9d Dec 20 01:44:14 Tower kernel: do_IRQ+0xaf/0xcd Dec 20 01:44:14 Tower kernel: common_interrupt+0xf/0xf Dec 20 01:44:14 Tower kernel: </IRQ> Dec 20 01:44:14 Tower kernel: RIP: 0010:fput+0x6/0x77 Dec 20 01:44:14 Tower kernel: Code: ff 53 31 ff 48 87 3d 1c 75 0f 01 48 85 ff 74 0d 48 8b 1f e8 66 fe ff ff 48 89 df eb ee 5b c3 e9 5a fe ff ff 53 f0 48 ff 4f 38 <75> 6d 48 89 fb 65 48 8b 3c 25 40 5c 01 00 65 8b 05 b0 72 ec 7e a9 Dec 20 01:44:14 Tower kernel: RSP: 0018:ffffc900020b7ac0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd8 Dec 20 01:44:14 Tower kernel: RAX: 0000000000000282 RBX: ffffc900020b7c80 RCX: ffff88840844cb68 Dec 20 01:44:14 Tower kernel: RDX: dead000000000200 RSI: 0000000000000282 RDI: ffff8883c84bd200 Dec 20 01:44:14 Tower kernel: RBP: ffff8883736d2000 R08: ffff8883c84bd300 R09: ffff8883c84bd300 Dec 20 01:44:14 Tower kernel: R10: ffffffff8115478e R11: ffff88840ba3ec80 R12: 0000000000000001 Dec 20 01:44:14 Tower kernel: R13: ffffc900020b7c80 R14: ffffc900020b7c50 R15: 0000000000000000 Dec 20 01:44:14 Tower kernel: ? generic_pipe_buf_confirm+0x3/0x3 Dec 20 01:44:14 Tower kernel: poll_freewait+0x3e/0x87 Dec 20 01:44:14 Tower kernel: do_sys_poll+0x39f/0x426 Dec 20 01:44:14 Tower kernel: ? udp_rmem_release+0x47/0x10b Dec 20 01:44:14 Tower kernel: ? _copy_to_user+0x22/0x28 Dec 20 01:44:14 Tower kernel: ? put_cmsg+0xaa/0xf5 Dec 20 01:44:14 Tower kernel: ? __skb_recv_udp+0x16b/0x27d Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: ? compat_poll_select_copy_remaining+0x118/0x118 Dec 20 01:44:14 Tower kernel: __se_sys_poll+0x55/0xd1 Dec 20 01:44:14 Tower kernel: do_syscall_64+0x57/0xf2 Dec 20 01:44:14 Tower kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Dec 20 01:44:14 Tower kernel: RIP: 0033:0x14820ba71e63 Dec 20 01:44:14 Tower kernel: Code: 49 8b 45 10 5d 41 5c 41 5d 41 5e c3 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48 Dec 20 01:44:14 Tower kernel: RSP: 002b:00007fff908cd478 EFLAGS: 00000246 ORIG_RAX: 0000000000000007 Dec 20 01:44:14 Tower kernel: RAX: ffffffffffffffda RBX: 0000000000426180 RCX: 000014820ba71e63 Dec 20 01:44:14 Tower kernel: RDX: 0000000000000064 RSI: 000000000000000b RDI: 0000000000449f90 Dec 20 01:44:14 Tower kernel: RBP: 000014820b965b00 R08: 0000000000000000 R09: 0000000000000006 Dec 20 01:44:14 Tower kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000426230 Dec 20 01:44:14 Tower kernel: R13: 000000000042ad10 R14: 00000000004229b0 R15: 0000000000000000 Dec 20 01:44:14 Tower kernel: ---[ end trace 32125fa20ad6dba7 ]--- Dec 20 04:00:01 Tower kernel: mdcmd (92): check Dec 20 04:00:01 Tower kernel: md: recovery thread: check P Q ... Dec 20 04:00:24 Tower root: /var/lib/docker: 20.6 GiB (22055636992 bytes) trimmed on /dev/loop2 Dec 20 04:00:24 Tower root: /mnt/cache: 766.1 GiB (822554726400 bytes) trimmed on /dev/nvme0n1p1 Dec 20 04:40:01 Tower apcupsd[4765]: apcupsd exiting, signal 15 Dec 20 04:40:01 Tower apcupsd[4765]: apcupsd shutdown succeeded Dec 20 04:40:04 Tower apcupsd[24328]: apcupsd 3.14.14 (31 May 2016) slackware startup succeeded Dec 20 04:40:04 Tower apcupsd[24328]: NIS server startup succeeded Dec 20 04:51:46 Tower crond[2019]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null Dec 20 08:52:29 Tower dhcpcd[6763]: br0: failed to renew DHCP, rebinding Quote Link to comment
K1ng0011 Posted December 28, 2020 Author Share Posted December 28, 2020 I was getting call trace errors every two days or so. As of this post I have had 7 days and 20 hours of uptime with no further call traces. I removed my 10 gig PCIE NIC and removed several plugins except the ones I have to have. I have had the "Host access to custom networks" setting enabled under the advanced docker settings along with dockers having their own IP addresses. I am going to let it run for another week. Then I will enable the "AMD Power Supply Idle Control" bios setting and the global C-States option. Let it run for another two weeks. After that I will reinstall my plugins and let that run for a two weeks. If I encounter no further issues it is likely specifically related to my NIC. Quote Link to comment
K1ng0011 Posted January 4, 2021 Author Share Posted January 4, 2021 Well is has been about 15 days. No further call traces, errors, or lockups. I logged into the bios and set the power supply idle control setting back to the "auto" setting and I enabled the global c-states option by setting it also to auto. If there are no further errors or issues I will start to reinstall my plugins. I will update this post in two weeks unless I get an error or a hard lockup again. Quote Link to comment
K1ng0011 Posted January 21, 2021 Author Share Posted January 21, 2021 I am back. No further call traces, errors, or lockups. I have had the power supply idle control setting set to auto and the global c-states set to auto. This has not caused any further issues. I will install some plugins and see if that causes any issues. At this point I think the issue is related to my 10Gig NIC somehow. However I will install some plugins and report back in another two weeks. 1 Quote Link to comment
K1ng0011 Posted February 13, 2021 Author Share Posted February 13, 2021 No further issues after installing my plugins. I does appear that this is related somehow related to my PCIE 10 gig NIC (ASUS XG-C100C). The problems stopped when I removed it from my server. For reference at this point I have had 30 days of uptime with no further hard locks, errors, or call traces. 2 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.