Help please! New Supermicro X10SRA-F build - hard crash every couple of days

al_uk · January 4, 2016

Happy new year all.

I built a new server just before Christmas. It has a PCIe USB card and also a quad tuner card passed through to a windows SageTV VM. There are 6 Windows VMs and 4 dockers running including a Blueiris cctv vm. It idles at around 20% CPU and 30% memory usage.

ACS Override is set as "off"

I am getting hard crashes every couple of days where the system simply hangs and needs to be reset. The console is frozen and does not respond to any input. The IPMI still works and I need to do a "reset" to bring the system back. Network connectivity is instantly lost. I can't see any logs.

The most recent time this happened. I had just plugged in a new USB device (a zwave stick) and had gone to the KVM GUI to stop the windows vm that I wanted to assign the device to. I clicked Stop" and the system hung at that point.

Previous times it has hung overnight when I was asleep and no-one else was making any changes.

I've included the diagnostics - and also output from lsusb.

I've also attached the IPMI screenshot. It shows "bad/missing sense data". I am not sure whether this happened at the time of the crash, or beforehand.

Any pointers much appreciated. Thank you.

MB - Supermicro X10SRA-F

CPU - Intel Xeon E5-2620 V3

RAM - 64GB. Samsung 16GB DDR4 2133 ECC REG. Model M393A2G40DB0-CPB

PSU - Seasonic 400W fanless

SSD - Cache - 1TB 850 Samsung EVO

SSD - VMs- 1TB 850 Samsung EVO

Storage - 4 x 8TB Seagate ST8000AS0002 HDD

Case - Fractal Design Define R5

Additional USB - Heise HPU-300NC USB3 2P PCIE

TV tuner card - TBS6285

tower-diagnostics-20160104-1814.zip

Frank1940 · January 4, 2016

Since this is a new build, have you run a memtst on your RAM?

I would also suggest that you request a Moderator to move this thread to the 'General Support (6)' section as it will probably get seen by a lot more people who could assist. IF you have a UPS connected to this server, see what the power draw is when the server is highly loaded. (400 watts might be a bit small....)

al_uk · January 4, 2016

Request to mods made. Thanks.

Yes I ran a default Memtest which took about 4 hours and didn't throw up any errors.

Cheers.

al_uk · January 4, 2016

Power draw at the plug is around 100 watts when it is busy. That was with an extra couple of old 500GB drives in as well.

al_uk · January 6, 2016

Anyone any ideas?

kaboooom2000uk · January 6, 2016

This may not be a solution, but have you ensured that you have updated the BIOS and IPMI Firmware? I had to do this on my X8QB6-F to get the IPMI to work as it had a bug.

al_uk · January 6, 2016

Thanks Kaboom. Yes I updated IPMI. Bios was already on version 1.00b.

Is there any more detailed logging I should enable in Unraid, which will persist across reboots/crashes?

kaboooom2000uk · January 6, 2016

Well I would look at dmseg, it may point to something if its kernel or hardware related, hopefully. Failing that, you could try in /var/logs I'm pretty new to this myself and also facing some issues of my own.

al_uk · January 7, 2016

System has been running fine for 2 days. Just now I went to stop a VM, and the host crashed after a few seconds.

After a reboot, starting and stopping VMs works fine. Just after some time I have the problem.

Here is dmesg. It seems to show some issues with one of the SSDs but I'm not sure if that is because of the crash, or causing the crash.

I'm really not sure what to try next...

dmesg.txt

al_uk · January 13, 2016

I replaced the PSU with a Seasonic 520W fanless version and the issue remains.

I can replicate the crash if I leave the system for a few hours, and then reboot my home automation vm, and spin up the drives at around the same time using the spin-up button on the gui. All the VMs have usb devices passed through except the SageTV vm which has the PCIe card passed through.

I think the issue goes away if I take the Tuner card out.

Here are the bios settings, anyone see any problems, or have any other suggestions. It is very frustrating having an expensive machine sat in bits not able to use it.

Thanks for any suggestions.

al_uk · January 13, 2016

more bios pics

al_uk · January 13, 2016

more bios pics

al_uk · January 15, 2016

Anyone any ideas?

I tried with ACS override switched on, but still same issue.

I'm currently trying with the Homeseer VM not passing through any USB devices, as it only seems to hang when that VM is rebooted. That VM has an RFXTRx device which is an FTDI232 USB to serial.

al_uk · January 16, 2016

Just had another lockup when rebooting the VM but with no USB devices attached.

I was running tail on the console and I think I just caught it before the 1st error scrolled off the screen.

al_uk · January 16, 2016

2nd screenshot

al_uk · January 17, 2016

Tried again, with xhci disabled in the bios.

I managed to capture some more messages by recording the IPMI before rebooting the VM that causes the crash. Here are the screenshots.

essjay · January 18, 2016

Is the motherboard dodgy? I'm looking at this very same board for a build with the 2620v3 and spotted this review which has me concerned along with your issues

http://www.newegg.com/Product/Product.aspx?Item=N82E16813182959

Where di you get the board from as I'm looking to get one in the UK/Europe?

al_uk · January 18, 2016

That's a worrying review.

MB came from Novatech. I'll contact them to see if I can swap it out.

Everything else came from Scan.

essjay · January 18, 2016

That's a worrying review.

MB came from Novatech. I'll contact them to see if I can swap it out.

Everything else came from Scan.

Thanks for the tip on Novatech.

That review does sound a lot like the issues you are having so maybe it's a dodgy board. I'll follow the thread and see how you get on.

essjay · January 21, 2016

Any update? Did you manage to return the board?

I'm still on the fence about this board or the ASRock X99 WS (£260 from Amazon)

al_uk · January 21, 2016

Some good news hopefully:-

I contacted Lime support and scheduled a troubleshooting session with Jonp on Tuesday. John was extremely helpful.

The outcome of the session was a series of changes to test.

The change that made the difference was adding iommu=pt into the syslinux.cfg

Since I did that on Tuesday night, the machine has not crashed.

I have put the system back to the original 400W PSU, with 7 HDDs and 3 SSDs, a TV tuner card, a USB PCIe card, and all USB sockets occupied.

I'll report back if the machine does crash. I'll breathe easier after a week!

Incidentally I did install Windows 10 on the bare metal as a test. I only ran it for a couple of hours but it didn't crash. John's view was that the hardware was fine, and that if I replaced the MB, the new one would do exactly the same. I think I would have struggled to return it to the supplier as faulty...

al_uk · February 18, 2016

Just to update, the system crashed after 25 days. And then again a few hours after reboot. I did have a syslog tail running to the flash drive and captured the following. It seems to indicate a CPU stall. I'm hoping 6.2 fixes it otherwise I guess I'll have to replace the motherboard with another model.

Feb 17 22:21:59 Tower autofan: Highest disk temp is 27°C, adjusting fan speed from: 56 (21% @ 877rpm) to: OFF (0% @ 877rpm)

Feb 17 22:23:08 Tower kernel: virbr0: port 8(vnet5) entered disabled state

Feb 17 22:23:08 Tower avahi-daemon[21435]: Withdrawing workstation service for vnet5.

Feb 17 22:23:08 Tower kernel: device vnet5 left promiscuous mode

Feb 17 22:23:08 Tower kernel: virbr0: port 8(vnet5) entered disabled state

Feb 17 22:23:08 Tower kernel: pl2303 1-7.4:1.0: pl2303 converter detected

Feb 17 22:23:08 Tower kernel: usb 1-7.4: pl2303 converter now attached to ttyUSB0

Feb 17 22:23:08 Tower kernel: ftdi_sio 1-12:1.0: FTDI USB Serial Device converter detected

Feb 17 22:23:08 Tower kernel: usb 1-12: Detected FT232RL

Feb 17 22:23:08 Tower kernel: usb 1-12: FTDI USB Serial Device converter now attached to ttyUSB1

Feb 17 22:24:49 Tower kernel: device vnet5 entered promiscuous mode

Feb 17 22:24:49 Tower kernel: virbr0: port 8(vnet5) entered listening state

Feb 17 22:24:51 Tower kernel: vfio-pci 0000:00:1b.0: enabling device (0000 -> 0002)

Feb 17 22:24:51 Tower kernel: pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0

Feb 17 22:24:51 Tower kernel: pl2303 1-7.4:1.0: device disconnected

Feb 17 22:24:51 Tower kernel: ftdi_sio ttyUSB1: FTDI USB Serial Device converter now disconnected from ttyUSB1

Feb 17 22:24:51 Tower kernel: ftdi_sio 1-12:1.0: device disconnected

Feb 17 22:24:51 Tower kernel: usb 5-1: reset full-speed USB device number 4 using xhci_hcd

Feb 17 22:24:52 Tower kernel: usb 1-7.4: reset full-speed USB device number 14 using xhci_hcd

Feb 17 22:24:52 Tower kernel: usb 1-12: reset full-speed USB device number 8 using xhci_hcd

Feb 17 22:24:53 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound

Feb 17 22:25:04 Tower kernel: virbr0: port 8(vnet5) entered learning state

Feb 17 22:25:19 Tower kernel: virbr0: topology change detected, propagating

Feb 17 22:25:19 Tower kernel: virbr0: port 8(vnet5) entered forwarding state

Feb 17 22:25:56 Tower kernel: INFO: rcu_preempt detected stalls on CPUs/tasks: { 6} (detected by 4, t=60002 jiffies, g=3157173, c=3157172, q=205177)

Feb 17 22:25:56 Tower kernel: Task dump for CPU 6:

Feb 17 22:25:56 Tower kernel: qemu-system-x86 R running task 0 8376 1 0x00000008

Feb 17 22:25:56 Tower kernel: ffff8810383e6000 ffff880a70763c78 000000018141bcb6 0000000000000000

Feb 17 22:25:56 Tower kernel: 0000000000000050 0000000000000000 0000000000000050 ffff8808c6e44a20

Feb 17 22:25:56 Tower kernel: 0000000000000050 ffff880a70763ca8 ffffffff810380c7 ffff880a70763ca8

Feb 17 22:25:56 Tower kernel: Call Trace:

Feb 17 22:25:56 Tower kernel: [<ffffffff810380c7>] ? setup_msi_irq+0x2b/0x8e

Feb 17 22:25:56 Tower kernel: [<ffffffff8141bdd7>] ? intel_msi_alloc_irq+0xa6/0xb0

Feb 17 22:25:56 Tower kernel: [<ffffffff8141d17a>] ? irq_remapping_setup_msi_irqs+0xd7/0x1e4

Feb 17 22:25:56 Tower kernel: [<ffffffff8100e8c3>] ? arch_setup_msi_irqs+0xa/0xc

Feb 17 22:25:56 Tower kernel: [<ffffffff8138dbec>] ? pci_enable_msi_range+0x1f6/0x294

Feb 17 22:25:56 Tower kernel: [<ffffffff8148604f>] ? vfio_msi_disable+0xb5/0xb5

Feb 17 22:25:56 Tower kernel: [<ffffffff814861a8>] ? vfio_pci_set_msi_trigger+0x159/0x271

Feb 17 22:25:56 Tower kernel: [<ffffffff814866fe>] ? vfio_pci_set_irqs_ioctl+0x92/0x9c

Feb 17 22:25:56 Tower kernel: [<ffffffff81485255>] ? vfio_pci_ioctl+0x397/0x7be

Feb 17 22:25:56 Tower kernel: [<ffffffffa12aeedb>] ? kvm_vm_ioctl+0x33a/0x60b [kvm]

Feb 17 22:25:56 Tower kernel: [<ffffffff81481550>] ? vfio_device_fops_unl_ioctl+0x1e/0x28

Feb 17 22:25:56 Tower kernel: [<ffffffff8110c316>] ? do_vfs_ioctl+0x367/0x421

Feb 17 22:25:56 Tower kernel: [<ffffffff81114033>] ? __fget+0x6c/0x78

Feb 17 22:25:56 Tower kernel: [<ffffffff8110c409>] ? SyS_ioctl+0x39/0x64

Feb 17 22:25:56 Tower kernel: [<ffffffff815f71ee>] ? system_call_fastpath+0x12/0x71

Feb 17 22:27:05 Tower autofan: Highest disk temp is 27°C, adjusting fan speed from: 56 (21% @ 877rpm) to: OFF (0% @ 868rpm)

al_uk · October 9, 2018

This system has been fine since Feb 2016 on version 6.2. But now it has crashed a 3 times over the last 2 weeks. Crashing started on v6.5.3 and I had another crash after upgrading to 6.6.1 last week.

Same CPU, Memory, PSU, Motherboard. A few additional drives. The PSU is a Seasonic 400W Platinum SS-400FL2.

The case is different - now one of these. https://www.logic-case.com/products/rackmount-chassis/3u/3u-server-case-w-16x-35-hot-swappable-satasas-drive-bays-minisas-atx-psu-sc-316-atx/

I attach diagnostics from the boot after the crash and a screenshot of the IPMI when it locked up. I don't know if the error shown is related to the lockup or not.

Any suggestions, or should I think about changing the MB and CPU?

Thanks.

tower-diagnostics-20181009-2201.zip

al_uk · October 13, 2018

Hi, anyone got any suggestions? Cheers!

JorgeB · October 13, 2018

Is it stable if you go back to say 6.4.0? You need to find out if something on your hardware went bad or if it is a kernel/microcode issue.

Help please! New Supermicro X10SRA-F build - hard crash every couple of days

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation