Unraid random reboots/crashes


Go to solution Solved by Shomesomesho,

Recommended Posts

Hello everyone, I'm hoping to get some help with my random reboot problem. I just built this system in November and I’ve been struggling to solve this on my own. My server seems to crash or reboot itself after anywhere from 5 – 72 hours of uptime. I have the array set to not start itself on boot, so I will come back to my server and see it just sitting, waiting for me to start the array, with notifications about unclean shutdowns. It seems more likely to occur when I’m doing tasks with a lot of disk activity like a Parity Sync or running Mover (with many TBs to move)

The server is plugged into an APC UPS with fresh batteries which report good health and I have NUT setup with a shutdown method.  

 

So far, I’ve done to following:

  • Changed the CPU from an i5 13500 to an i7 13700k (upgrade unrelated to this issue).
  • Removed an old disk with questionable SMART data, without alleviation.
  • Replaced the TIM on my HBA.
  • Reseated the RAM.
  • I'm currently running memtest, but do not have results yet.

 

I mirrored the syslog to flash and will attach it.

 

System Specs:

  • Intel Core i5-13500 > Intel Core i7-13700K.
  • ASRock Z690 Extreme ATX LGA1700 Motherboard
  • G.Skill Ripjaws V 64 GB (4 x 16 GB) DDR4-3200 CL16 Memory
  • Appdata Storage:

               Western Digital Black SN850X 2 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive

  • Download Cache:

               2x SAMSUNG 870 QVO 2.5"

               1x SAMSUNG 850 EVO 2.5”

  • PSU: SeaSonic FOCUS Plus Platinum 750 W 80+ Platinum
  • HBA: LSI 9300-16i SAS in IT Mode.

 

Please let me know if I’ve missed anything important or if more info is needed. Thank you.

              

syslog

Edited by Shomesomesho
Link to comment

Thank you @JorgeB I will try that. I realize this isn't an Unraid specific question, but how long/how many passes would you advise memtest run?

 

Edit: I'm also seeing these messages in my log file:

Dec 11 01:11:11 Unraid kernel: mpt3sas_cm1: SAS host is non-operational !!!!
Dec 11 01:11:12 Unraid kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
Dec 11 01:11:12 Unraid kernel: mpt3sas_cm0: fault_state(0x2667)!
Dec 11 01:11:12 Unraid kernel: mpt3sas_cm0: sending diag reset !!
Dec 11 01:11:12 Unraid kernel: mpt3sas_cm1 fault info from func: mpt3sas_base_make_ioc_ready
Dec 11 01:11:12 Unraid kernel: mpt3sas_cm1: fault_state(0x2667)!
Dec 11 01:11:12 Unraid kernel: mpt3sas_cm1: sending diag reset !!

 

And then right after the above messages:

Dec 11 01:13:36 Unraid kernel: Linux version 6.1.49-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP PREEMPT_DYNAMIC Wed Aug 30 09:42:35 PDT 2023
Dec 11 01:13:36 Unraid kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot
Dec 11 01:13:36 Unraid kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Dec 11 01:13:36 Unraid kernel: BIOS-provided physical RAM map:

 

Was this the reboot/crash point?

Edited by Shomesomesho
Link to comment
1 hour ago, Shomesomesho said:

SAS host is non-operational !!!!

This is a HBA problem, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot.

 

1 hour ago, Shomesomesho said:

Dec 11 01:13:36 Unraid kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Dec 11 01:13:36 Unraid kernel: BIOS-provided physical RAM map:

These should not be a problem.

 

1 hour ago, Shomesomesho said:

but how long/how many passes would you advise memtest run?

Ideally 24H but 2 or 3 passes are usually enough to find issues if it problem is big enough.

Link to comment
11 minutes ago, JorgeB said:

This is a HBA problem, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot.

I'll try moving it to a different slot. The card has a fan mounted directly on the heatsink, but I've not found a way to monitor temps.

 

11 minutes ago, JorgeB said:

These should not be a problem.

Ah, sorry. I meant are these the things logged when Unraid first boots, illustrating that this was the point that the server rebooted?

 

11 minutes ago, JorgeB said:

Ideally 24H but 2 or 3 passes are usually enough to find issues if it problem is big enough.

Thank you for all your help. :)

Edited by Shomesomesho
Link to comment

3 Days, 14 hours uptime since just swapping the HBA out of the PCIe 5.0 into the PCIe 4.0. Don't want to get hopeful too soon, but this seems to have been the problem. Can anyone explain why the 5.0 slot would have an issue with my HBA? Should I have set the 5.0 slot in the bios to 3.0 link speed? Should I be setting the 4.0 to 3.0 link speeds?

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.