tbone Posted April 5, 2021 Share Posted April 5, 2021 I recently picked up a Supermicro X8DTH-6 system to upgrade my unRAID server, and I'm running into problems where it locks up within a few days. No events get logged by IPMI, nothing gets sent to my syslog server, and nothing shows up on the console. It's unresponsive over the network or via the console when this happens. I've tested maxing out the CPU cores for a while, and sensors show no problems with heat or the power supply. The last time this happened, I tried doing the Alt-SysRq trick to see if I could get more diagnostic info out of it, but that didn't seem to do anything. I was hopeful after reading the thread about kernel panics caused by a bug in the macvlan, but it looked like that maybe only (or mostly?) affected people running docker containers with static IPs on br0. This machine isn't running docker or VMs yet, because there's no license on this flash drive (my licensed copy is running on a temporary server while I try to get this sorted out). The machine is obviously idle, but it's still hanging every couple of days. The machine itself is an X8DTH-6 with dual Xeon-E5645 CPUs, 96GB of ECC DDR3 RAM, a built-in LSI 2008-based SAS HBA, and an additional LSI 2008-based HBA in a PCIe slot. They're each flashed to IT mode with the most recent firmware. They're each connected to a SAS expander (one LSI, one HP) with some random drives attached for testing. There's an Nvidia Quadro P400 and a Geforce 750Ti in it, and a generic USB3 card. It has two built-in NICs running in active-backup mode (besides the dedicated IPMI NIC). It has an IPMI BMC with remote KVM, and the sensors all look okay. I'm not sure what the next step should be here. I can try rolling back to 6.8, or turning off bridging on the NIC(s), or installing a separate copy of linux onto a drive and running various burn-in tests. Am I missing something? tower2-diagnostics-20210405-0845.zip Quote Link to comment
JorgeB Posted April 5, 2021 Share Posted April 5, 2021 One thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
Alwezhpy Posted April 12, 2021 Share Posted April 12, 2021 I had this problem when I first started using Unraid and someone helped me by suggesting I turn Global C-States in my CMOS from Auto to Disable. Worked like a champ. Good luck. Quote Link to comment
tbone Posted April 12, 2021 Author Share Posted April 12, 2021 Interesting; it looks like that is enabled by default on this board, so I can try turning it off. Were you also using a X8DTH board? Meanwhile, I did further testing; the machine was sitting idle with the array stopped, with the unraid dashboard page open, and pinging the machine every second. It failed after about 13 hours, so I pulled all the PCIe cards out of it; no GPUs or extra HBAs, and booted without plugins. Then it failed after about 5 hours. Next I turned off bridging and bonding of the NICs and made no other changes. It ran for 3.5 days, at which point unRAID 6.9.2 was released. I updated to that, put the cards back in it, and turned the bridging/bonding settings back on. It's still up after almost 3 days so far. So it could still be random and intermittent, but it is tracking with those network features. Seems weird since most reports of that bug dealt with using docker or KVM bridging interfaces, but who knows. I think if it stays up another day or so, I'll try switching services over to the new board and try it out for real. Quote Link to comment
cros13 Posted April 12, 2021 Share Posted April 12, 2021 I've had the same issue with an X10SRA-F for the last 2 years and had no solution and several rounds of corrupted cache drives due to the kernel panics. Nobody ever responded to my threads so I've never tried the C-States fix Alwezhpy suggested. Quote Link to comment
bonienl Posted April 12, 2021 Share Posted April 12, 2021 I have the Supermicro X10SRA-F motherboard too, rock solid and zero issues. Make sure you use the latest BMC and BIOS versions. Quote Link to comment
tbone Posted April 12, 2021 Author Share Posted April 12, 2021 Well, I just turned off the C-States in the BIOS to test it like that for a while. Offhand it looks like the system has the same power consumption when idle as before, so there's not a downside there. Quote Link to comment
codefaux Posted April 28, 2021 Share Posted April 28, 2021 I'm gonna cry. I just searched the forum for X8DTH -- my motherboard -- intending to see if an SAS expander worked with the onboard controller. I've been fighting instability. Kernel panics, random crashes in random kernel submodules (tip: dmesg -w over ssh, you'll see kernel logs as they tick right until it hits the ditch, and since it's ssh the terminal won't go blank when it reboots) and I tried everything. Swapped to known good RAM, known good CPU pair, known good PSU... I tried limiting RAM speed in the BIOS, dropping multiplier, disabling cores, virtualization, and on and on. Finally as a last-ditch I used sysfs to disable c-states from within unRAID, because I didn't think to turn them off in the BIOS. for cpus in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do echo 1 > $cpus; done It's been stable since. Granted that was only earlier today, but normally by now it would've done something stupid. Thank you all for confirming my suspicions. ......crap, that means I have to put the faster CPUs back in now... Quote Link to comment
cros13 Posted April 28, 2021 Share Posted April 28, 2021 (edited) Tried the C-States BIOS setting on the X10SRA-F. System was stable for about 10 days.... and then the same netfilter kernel panic I've been getting for the last 18 months. Edited April 28, 2021 by cros13 Quote Link to comment
codefaux Posted April 29, 2021 Share Posted April 29, 2021 (edited) On 4/28/2021 at 3:50 AM, cros13 said: netfilter kernel panic On 4/12/2021 at 1:14 PM, cros13 said: Nobody ever responded to my threads I'm new here, and I may not be able to help, but if you start a new thread and post a diagnostics.zip with it, I'll at least take a look. We're all volunteers here, and if nobody responds it's generally because there's not enough information or there's plenty of information but nobody has any ideas. I'll comment either way this time, but if I don't see your post, let me know. Edited April 29, 2021 by codefaux Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.