rwdesigner Posted December 26, 2021 Share Posted December 26, 2021 (edited) Howdy! I've really been enjoying unraid, and have recently migrated my install to new hardware. Previously, I was using a Supermicro board with dual xeons, and it was working pretty well, with the occasional freezeup. I was trying to use the unraid server for many things, but with the occasional freezups, I started removing tasks, drives, and hardware to get down to just: -Quadro P2000, -400GB PCIe NVMe (cache), and two 16TB HDDs. Software was all Dockers: PLEX, Tautulli, and Pihole. I wanted to move to a board with a newer generation processor with quicksync, a m.2 ssd, and also something that was quieter. So, I upgraded to: Intel Core i7-11700K ASUS PRIME B560-PLUS 16GB DDR4 2133Mhz 4Gx4 Noctua NHU-9S SAMSUNG 980 PRO M.2 2280 2TB PCIe Gen 4.0 x4, NVMe (cache) Quadro P2000 (still installed) The hardware upgrade came with some issues - the old motherboard had IPMI and I had setup nerdtools to check it for stats and report to the dashboard. Well, the new MB doesn't have IPMI and the log file was filling up with IPMI failures to connect or something like that. So, I removed the IPMI tool.... And I successfully migrated the cache drive from the 400GB PCIe NVMe to a SAMSUNG 980 PRO M.2 2280 2TB PCIe Gen 4.0. Unfortunately, I'm still having issues where the system will abruptly reboot. The last time I was just running a parity check and dockers were disabled. To troubleshoot, I've: Run MEMTEST86 - 2 passes - 0 errors (2hrs) Replaced SATA cables to HDDS Removed all other SATA SSDs the only drives connected are: USB (unraid), m.2 cache drive, 2x 16TB HDDs Some of my research pointed toward "dockers using custom br0 network interface can cause kernel panics" so I disabled PiHole Yesterday I was trying to get OpenVPN working and I enabled IPv6 (previously was disabled... maybe due to previous troubleshooting with this issue?) Currently: removed USB wireless keyboard (maybe this was causing the kernel panic when it went into sleep mode?), attached PS/2 keyboard. Dockers disabled. Running a parity check. ..... and it crashed/rebooted again with no dockers running and it was just running a parity check.... I really feel like just starting over and seeing if that fixes anything... maybe just get a new test unraid usb going with some test HDDs and see if that configuration crashes.... TLDR: Unraid system randomly kernel panics. Please and Thank you for the help. I'm at my wit's end here =\ Edited December 26, 2021 by rwdesigner Quote Link to comment
rwdesigner Posted December 27, 2021 Author Share Posted December 27, 2021 helix-diagnostics-20211226-1643.zip Quote Link to comment
JorgeB Posted December 27, 2021 Share Posted December 27, 2021 Enable the syslog server and post that after a crash. Quote Link to comment
rwdesigner Posted December 27, 2021 Author Share Posted December 27, 2021 Thank you for the reply! Syslog server is enabled. I created a new share named syslogdata which is only on the cache and is public. However, the share is still empty, so it seems like the syslog server is not writing to the share, so I'm not getting any logs... Quote Link to comment
itimpi Posted December 28, 2021 Share Posted December 28, 2021 when setting up the syslog server what did you put into the "Remote syslog server" field? It needs to be the address of the Unraid server for it to act as both client and server. Quote Link to comment
rwdesigner Posted December 28, 2021 Author Share Posted December 28, 2021 Thank you. I had the remote server field blank... Local Server - Enabled - UDP - 514 Local Syslog Folder - syslogdata Rotation - Disabled Remote syslog server - (ip of unraid machine) - UDP - 514 Mirror to flash - No Ok, running a parity check - I'll see if any logs make it to the syslogdata share. Thank you again Quote Link to comment
rwdesigner Posted December 28, 2021 Author Share Posted December 28, 2021 Well, I guess it crashed and rebooted because the uptime is only 2 hours... Looks like logging worked... thank you syslog-192.168.10.7.log Quote Link to comment
rwdesigner Posted December 31, 2021 Author Share Posted December 31, 2021 If anyone has any suggestions, I sure would appreciate it. Thank you Quote Link to comment
JorgeB Posted December 31, 2021 Share Posted December 31, 2021 There's nothing obvious logged, if it's a hardware problem it's kind of expected, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
rwdesigner Posted December 31, 2021 Author Share Posted December 31, 2021 Thank you, Jorge. The system has been up for 3 days with no dockers/VMs running. But if I try to just do a parity check, I know it will crash. I'll try doing a parity check in safe mode and see what happens... Quote Link to comment
rwdesigner Posted December 31, 2021 Author Share Posted December 31, 2021 Less than an hour later... Kernel panic - not syncing : Fatal exception in interrupt If this is not a software issue with unraid, jeez, I don't even know where to start. Try a bootable usb with Seatools and try to run SMART check there? if that crashes... it's the CPU/MB/RAM/HDD right? The CPU, MB, and PSU are new. The HDDs are about a year old. The RAM is used, but it passed a couple hours of memtest. Thank you for the help Quote Link to comment
rwdesigner Posted December 31, 2021 Author Share Posted December 31, 2021 is there any chance this is related to the system time? Quote Dec 31 12:50:54 Helix ntpd[2023]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Dec 31 13:14:59 Helix kernel: md: recovery thread: P corrected, sector=891215144 Dec 31 13:14:59 Helix kernel: md: recovery thread: P corrected, sector=891215152 Dec 31 13:14:59 Helix kernel: md: recovery thread: P corrected, sector=891220760 Dec 31 13:23:26 Helix kernel: general protection fault, probably for non-canonical address 0x7063742f30373409: 0000 [#1] SMP NOPTI Dec 31 13:23:26 Helix kernel: CPU: 10 PID: 0 Comm: swapper/10 Not tainted 5.10.28-Unraid #1 Dec 31 13:23:26 Helix kernel: Hardware name: ASUS System Product Name/PRIME B560-PLUS, BIOS 0820 04/27/2021 Dec 31 13:23:26 Helix kernel: RIP: 0010:bio_endio+0x50/0xc7 Dec 31 13:23:26 Helix kernel: Code: 01 75 0b 48 8b 45 08 48 85 c0 75 17 eb 2d 48 83 7d 58 00 74 ee 48 89 ef e8 02 8e 02 00 84 c0 75 e2 eb 7b 48 8b 80 a8 03 00 00 <48> 8b 78 28 48 85 ff 74 08 48 89 ee e8 a2 7a 01 00 48 81 7d 38 20 Dec 31 13:23:26 Helix kernel: RSP: 0018:ffffc90000334eb0 EFLAGS: 00010286 Dec 31 13:23:26 Helix kernel: RAX: 7063742f30373409 RBX: ffff888104ef7800 RCX: 0000000000000001 Dec 31 13:23:26 Helix kernel: RDX: 0000000000000000 RSI: ffff8881039a0fa0 RDI: ffff8881039a0f28 Dec 31 13:23:26 Helix kernel: RBP: ffff8881039a0f28 R08: ffff888104ef7800 R09: 0000000000000200 Dec 31 13:23:26 Helix kernel: R10: 0000000000000002 R11: ffffffff8251e750 R12: 000000000006f000 Dec 31 13:23:26 Helix kernel: R13: 0000000000039000 R14: 0000000000000000 R15: 0000000000001000 Dec 31 13:23:26 Helix kernel: FS: 0000000000000000(0000) GS:ffff88844f680000(0000) knlGS:0000000000000000 Dec 31 13:23:26 Helix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 31 13:23:26 Helix kernel: CR2: 0000150982217ff8 CR3: 000000000400a006 CR4: 0000000000770ee0 Dec 31 13:23:26 Helix kernel: PKRU: 55555554 Dec 31 13:23:26 Helix kernel: Call Trace: Dec 31 13:23:26 Helix kernel: <IRQ> Dec 31 13:23:26 Helix kernel: blk_update_request+0x1f9/0x2ad Dec 31 13:23:26 Helix kernel: scsi_end_request+0x22/0xda Dec 31 13:23:26 Helix kernel: scsi_io_completion+0x146/0x3bf Dec 31 13:23:26 Helix kernel: blk_done_softirq+0x7c/0x99 Dec 31 13:23:26 Helix kernel: __do_softirq+0xc4/0x1c2 Dec 31 13:23:26 Helix kernel: asm_call_irq_on_stack+0xf/0x20 Dec 31 13:23:26 Helix kernel: </IRQ> Dec 31 13:23:26 Helix kernel: do_softirq_own_stack+0x2c/0x39 Dec 31 13:23:26 Helix kernel: __irq_exit_rcu+0x45/0x80 Dec 31 13:23:26 Helix kernel: common_interrupt+0x119/0x12e Dec 31 13:23:26 Helix kernel: asm_common_interrupt+0x1e/0x40 Dec 31 13:23:26 Helix kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8 Dec 31 13:23:26 Helix kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5 Dec 31 13:23:26 Helix kernel: RSP: 0018:ffffc9000014bea0 EFLAGS: 00000246 Dec 31 13:23:26 Helix kernel: RAX: ffff88844f6a2380 RBX: 0000000000000003 RCX: 000000000000001f Dec 31 13:23:26 Helix kernel: RDX: 0000000000000000 RSI: 00000000238e38e3 RDI: 0000000000000000 Dec 31 13:23:26 Helix kernel: RBP: ffffe8ffffc99000 R08: 0000021c3524ab48 R09: 000000000000023b Dec 31 13:23:26 Helix kernel: R10: 0000000000000252 R11: 071c71c71c71c71c R12: 0000021c3524ab48 Dec 31 13:23:26 Helix kernel: R13: ffffffff820c5dc0 R14: 0000000000000003 R15: 0000000000000000 Dec 31 13:23:26 Helix kernel: cpuidle_enter_state+0x101/0x1c4 Dec 31 13:23:26 Helix kernel: cpuidle_enter+0x25/0x31 Dec 31 13:23:26 Helix kernel: do_idle+0x1a6/0x214 Dec 31 13:23:26 Helix kernel: cpu_startup_entry+0x18/0x1a Dec 31 13:23:26 Helix kernel: secondary_startup_64_no_verify+0xb0/0xbb Dec 31 13:23:26 Helix kernel: Modules linked in: xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding wmi_bmof x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd nvme cryptd i2c_i801 nvme_core i2c_smbus i2c_core video input_leds glue_helper led_class ahci wmi e1000e libahci backlight thermal acpi_pad button fan Dec 31 13:23:26 Helix kernel: ---[ end trace 3004778bc7b6ba3c ]--- Dec 31 13:23:26 Helix kernel: RIP: 0010:bio_endio+0x50/0xc7 Quote Link to comment
JorgeB Posted January 1, 2022 Share Posted January 1, 2022 15 hours ago, rwdesigner said: But if I try to just do a parity check, I know it will crash. This can sometimes happens with some kernel/hardware combinations, try updating to v6.10-rc2, it uses a much newer kernel. Quote Link to comment
rwdesigner Posted January 1, 2022 Author Share Posted January 1, 2022 I feel like I'm in crazy land here... When trying to update, I see two lines -- one with the current installed (6.9.2) version and one with 6.9.1.... I click " Branch - Next " (that line disappears and all I see is 6.9.1 with the Restore option) - then after I click " Check for Updates " - the popup window appears and in the background I see v6.10-rc2 and under Status, I see the Install button, but when I close the popup window, that changes back to "6.9.2 - Up to Date" - I tried this with both the array started and stopped. Okay after trying the same procedure about 10 times, after I close the popup window, the Install button persisted... weird weird weird... I'll report back upgrading helps. Thank you Quote Link to comment
rwdesigner Posted January 1, 2022 Author Share Posted January 1, 2022 Updated to v6.10-rc2. Rebooted. Crashed again while doing a parity check... Humm.. Quote Link to comment
rwdesigner Posted January 1, 2022 Author Share Posted January 1, 2022 (edited) Installed Win10 on a SATA SSD & updated windows Downloaded & Installed SeaTools for Windows SMART test - both drives passed Short DST - both drives passed Short Generic Read Test - both drives passed Long Generic Read Test - both drives passed Edited January 2, 2022 by rwdesigner long test finished Quote Link to comment
JorgeB Posted January 2, 2022 Share Posted January 2, 2022 Unlikely to be a disk issue, but could be hardware related, board, RAM, etc. Quote Link to comment
rwdesigner Posted January 2, 2022 Author Share Posted January 2, 2022 6 hours ago, JorgeB said: Unlikely to be a disk issue, but could be hardware related, board, RAM, etc. I know how to run tests for the HDDs, RAM, and CPU, but I have no idea how to run any motherboard tests. And if the system is fully functional under windows, perhaps that indicates a compatibility issue between Unraid and this hardware. Which is very frustrating because Unraid has a reputation for not caring about what hardware it is running on. The long generic HDD test has been running for almost 24 hours. I don't see any issues reported yet. I just bought all of this new hardware and damn is this disheartening... I guess I'll try setting up a new unraid install with different HDDs and see if this piece of junk continues to crash. Maybe it's something corrupt with the Unraid files/config.... I would pay for help at this point... Quote Link to comment
Vr2Io Posted January 2, 2022 Share Posted January 2, 2022 7 hours ago, JorgeB said: Unlikely to be a disk issue Agree On 1/1/2022 at 2:51 AM, rwdesigner said: But if I try to just do a parity check, I know it will crash. That's great, it crash immediate so more easy for troubleshoot, better a lot then intermittent. I will use some dummy disk for test. BTW, it still look like memory issue, suggest more deeper in memtest. Or just simple test with one RAM module. Quote Link to comment
rwdesigner Posted January 2, 2022 Author Share Posted January 2, 2022 30 minutes ago, Vr2Io said: BTW, it still look like memory issue, suggest more deeper in memtest. Or just simple test with one RAM module. Since the ram is older/used, I decided to order some ram off the motherboard's QVL list and see if that resolves the issue. Thank you. Quote Link to comment
Solution rwdesigner Posted January 14, 2022 Author Solution Share Posted January 14, 2022 Well, I have good news to report. After purchasing some ram (Corsair Dominator Platinum 16GB (2x8GB) DDR4 Gen6 3200MHz) off the motherboard's QVL list, the parity check completed successfully! No crashing. No reboots. Quote Last check completed on Fri 14 Jan 2022 10:02:52 AM CST (today) Finding 40 errors Duration: 21 hours, 41 minutes, 54 seconds. Average speed: 204.8 MB/sec Uptime: 1 day, 2 hours, 45 minutes! Just started Dockers and... I'll report back, but it seems that: SOLUTION: Install RAM which is listed on the motherboard's QVL 2 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.