Shomesomesho Posted December 11, 2023 Share Posted December 11, 2023 (edited) Hello everyone, I'm hoping to get some help with my random reboot problem. I just built this system in November and I’ve been struggling to solve this on my own. My server seems to crash or reboot itself after anywhere from 5 – 72 hours of uptime. I have the array set to not start itself on boot, so I will come back to my server and see it just sitting, waiting for me to start the array, with notifications about unclean shutdowns. It seems more likely to occur when I’m doing tasks with a lot of disk activity like a Parity Sync or running Mover (with many TBs to move) The server is plugged into an APC UPS with fresh batteries which report good health and I have NUT setup with a shutdown method. So far, I’ve done to following: Changed the CPU from an i5 13500 to an i7 13700k (upgrade unrelated to this issue). Removed an old disk with questionable SMART data, without alleviation. Replaced the TIM on my HBA. Reseated the RAM. I'm currently running memtest, but do not have results yet. I mirrored the syslog to flash and will attach it. System Specs: Intel Core i5-13500 > Intel Core i7-13700K. ASRock Z690 Extreme ATX LGA1700 Motherboard G.Skill Ripjaws V 64 GB (4 x 16 GB) DDR4-3200 CL16 Memory Appdata Storage: Western Digital Black SN850X 2 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive Download Cache: 2x SAMSUNG 870 QVO 2.5" 1x SAMSUNG 850 EVO 2.5” PSU: SeaSonic FOCUS Plus Platinum 750 W 80+ Platinum HBA: LSI 9300-16i SAS in IT Mode. Please let me know if I’ve missed anything important or if more info is needed. Thank you. syslog Edited December 11, 2023 by Shomesomesho Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 If memtest doesn't find anything try running the server with just one stick of RAM, if the same try a different one, that will basically rule out bad RAM or a board issue with all 4 DIMMs loaded. Quote Link to comment
Shomesomesho Posted December 11, 2023 Author Share Posted December 11, 2023 (edited) Thank you @JorgeB I will try that. I realize this isn't an Unraid specific question, but how long/how many passes would you advise memtest run? Edit: I'm also seeing these messages in my log file: Dec 11 01:11:11 Unraid kernel: mpt3sas_cm1: SAS host is non-operational !!!! Dec 11 01:11:12 Unraid kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready Dec 11 01:11:12 Unraid kernel: mpt3sas_cm0: fault_state(0x2667)! Dec 11 01:11:12 Unraid kernel: mpt3sas_cm0: sending diag reset !! Dec 11 01:11:12 Unraid kernel: mpt3sas_cm1 fault info from func: mpt3sas_base_make_ioc_ready Dec 11 01:11:12 Unraid kernel: mpt3sas_cm1: fault_state(0x2667)! Dec 11 01:11:12 Unraid kernel: mpt3sas_cm1: sending diag reset !! And then right after the above messages: Dec 11 01:13:36 Unraid kernel: Linux version 6.1.49-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP PREEMPT_DYNAMIC Wed Aug 30 09:42:35 PDT 2023 Dec 11 01:13:36 Unraid kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot Dec 11 01:13:36 Unraid kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks Dec 11 01:13:36 Unraid kernel: BIOS-provided physical RAM map: Was this the reboot/crash point? Edited December 11, 2023 by Shomesomesho Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 1 hour ago, Shomesomesho said: SAS host is non-operational !!!! This is a HBA problem, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot. 1 hour ago, Shomesomesho said: Dec 11 01:13:36 Unraid kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks Dec 11 01:13:36 Unraid kernel: BIOS-provided physical RAM map: These should not be a problem. 1 hour ago, Shomesomesho said: but how long/how many passes would you advise memtest run? Ideally 24H but 2 or 3 passes are usually enough to find issues if it problem is big enough. Quote Link to comment
Shomesomesho Posted December 11, 2023 Author Share Posted December 11, 2023 (edited) 11 minutes ago, JorgeB said: This is a HBA problem, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot. I'll try moving it to a different slot. The card has a fan mounted directly on the heatsink, but I've not found a way to monitor temps. 11 minutes ago, JorgeB said: These should not be a problem. Ah, sorry. I meant are these the things logged when Unraid first boots, illustrating that this was the point that the server rebooted? 11 minutes ago, JorgeB said: Ideally 24H but 2 or 3 passes are usually enough to find issues if it problem is big enough. Thank you for all your help. Edited December 11, 2023 by Shomesomesho Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 1 hour ago, Shomesomesho said: Dec 11 01:13:36 Unraid kernel: Linux version 6.1.49-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP Yes, this line means a new boot. 1 Quote Link to comment
Shomesomesho Posted December 13, 2023 Author Share Posted December 13, 2023 Memtest passed with no errors. I swapped the HBA to a different port and it's been up for 24 hours so far. If it crashes again, I'll remove 3 sticks of ram and test each individually. Quote Link to comment
Shomesomesho Posted December 15, 2023 Author Share Posted December 15, 2023 3 Days, 14 hours uptime since just swapping the HBA out of the PCIe 5.0 into the PCIe 4.0. Don't want to get hopeful too soon, but this seems to have been the problem. Can anyone explain why the 5.0 slot would have an issue with my HBA? Should I have set the 5.0 slot in the bios to 3.0 link speed? Should I be setting the 4.0 to 3.0 link speeds? Quote Link to comment
JorgeB Posted December 15, 2023 Share Posted December 15, 2023 10 minutes ago, Shomesomesho said: Can anyone explain why the 5.0 slot would have an issue with my HBA? I can't. Quote Link to comment
Solution Shomesomesho Posted January 4 Author Solution Share Posted January 4 On 12/11/2023 at 11:41 AM, JorgeB said: you can also try a different PCIe slot. System hasn't had a similar crash since. Going to say this was the problem. Thank you for your help! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.