New system locking up on 6.9.1 without logging anything


Recommended Posts

I recently picked up a Supermicro X8DTH-6 system to upgrade my unRAID server, and I'm running into problems where it locks up within a few days. No events get logged by IPMI, nothing gets sent to my syslog server, and nothing shows up on the console. It's unresponsive over the network or via the console when this happens. I've tested maxing out the CPU cores for a while, and sensors show no problems with heat or the power supply. The last time this happened, I tried doing the Alt-SysRq trick to see if I could get more diagnostic info out of it, but that didn't seem to do anything.

 

I was hopeful after reading the thread about kernel panics caused by a bug in the macvlan, but it looked like that maybe only (or mostly?) affected people running docker containers with static IPs on br0. This machine isn't running docker or VMs yet, because there's no license on this flash drive (my licensed copy is running on a temporary server while I try to get this sorted out). The machine is obviously idle, but it's still hanging every couple of days.

 

The machine itself is an X8DTH-6 with dual Xeon-E5645 CPUs, 96GB of ECC DDR3 RAM, a built-in LSI 2008-based SAS HBA, and an additional LSI 2008-based HBA in a PCIe slot. They're each flashed to IT mode with the most recent firmware. They're each connected to a SAS expander (one LSI, one HP) with some random drives attached for testing. There's an Nvidia Quadro P400 and a Geforce 750Ti in it, and a generic USB3 card. It has two built-in NICs running in active-backup mode (besides the dedicated IPMI NIC). It has an IPMI BMC with remote KVM, and the sensors all look okay.

 

I'm not sure what the next step should be here. I can try rolling back to 6.8, or turning off bridging on the NIC(s), or installing a separate copy of linux onto a drive and running various burn-in tests. Am I missing something?

tower2-diagnostics-20210405-0845.zip

Link to comment

Interesting; it looks like that is enabled by default on this board, so I can try turning it off. Were you also using a X8DTH board?

 

 

Meanwhile, I did further testing; the machine was sitting idle with the array stopped, with the unraid dashboard page open, and pinging the machine every second. It failed after about 13 hours, so I pulled all the PCIe cards out of it; no GPUs or extra HBAs, and booted without plugins. Then it failed after about 5 hours. Next I turned off bridging and bonding of the NICs and made no other changes. It ran for 3.5 days, at which point unRAID 6.9.2 was released. I updated to that, put the cards back in it, and turned the bridging/bonding settings back on. It's still up after almost 3 days so far.

 

 

So it could still be random and intermittent, but it is tracking with those network features. Seems weird since most reports of that bug dealt with using docker or KVM bridging interfaces, but who knows.

 

I think if it stays up another day or so, I'll try switching services over to the new board and try it out for real.

 

 

Link to comment
  • 3 weeks later...

I'm gonna cry.

 

I just searched the forum for X8DTH -- my motherboard -- intending to see if an SAS expander worked with the onboard controller.

 

I've been fighting instability. Kernel panics, random crashes in random kernel submodules (tip: dmesg -w over ssh, you'll see kernel logs as they tick right until it hits the ditch, and since it's ssh the terminal won't go blank when it reboots) and I tried everything. Swapped to known good RAM, known good CPU pair, known good PSU... I tried limiting RAM speed in the BIOS, dropping multiplier, disabling cores, virtualization, and on and on.

 

Finally as a last-ditch I used sysfs to disable c-states from within unRAID, because I didn't think to turn them off in the BIOS.

 

 

for cpus in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do echo 1 > $cpus; done

 

 

It's been stable since. Granted that was only earlier today, but normally by now it would've done something stupid.

 

Thank you all for confirming my suspicions.

 

......crap, that means I have to put the faster CPUs back in now...

Link to comment
On 4/28/2021 at 3:50 AM, cros13 said:

netfilter kernel panic

 

On 4/12/2021 at 1:14 PM, cros13 said:

Nobody ever responded to my threads

I'm new here, and I may not be able to help, but if you start a new thread and post a diagnostics.zip with it, I'll at least take a look. We're all volunteers here, and if nobody responds it's generally because there's not enough information or there's plenty of information but nobody has any ideas. I'll comment either way this time, but if I don't see your post, let me know.

Edited by codefaux
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.