Help please! New Supermicro X10SRA-F build - hard crash every couple of days


al_uk

Recommended Posts

Happy new year all.

 

I built a new server just before Christmas. It has a PCIe USB card and also a quad tuner card passed through to a windows SageTV VM. There are 6 Windows VMs and 4 dockers running including a Blueiris cctv vm. It idles at around 20% CPU and 30% memory usage.

 

ACS Override is set as "off"

 

I am getting hard crashes every couple of days where the system simply hangs and needs to be reset. The console is frozen and does not respond to any input. The IPMI still works and I need to do a "reset" to bring the system back. Network connectivity is instantly lost. I can't see any logs.

 

The most recent time this happened. I had just plugged in a new USB device (a zwave stick) and had gone to the KVM GUI to stop the windows vm that I wanted to assign the device to. I clicked Stop" and the system hung at that point.

 

Previous times it has hung overnight when I was asleep and no-one else was making any changes.

 

I've included the diagnostics - and also output from lsusb.

 

I've also attached the IPMI screenshot. It shows "bad/missing sense data". I am not sure whether this happened at the time of the crash, or beforehand.

 

Any pointers much appreciated. Thank you.

 

 

MB - Supermicro X10SRA-F

CPU - Intel Xeon E5-2620 V3

RAM - 64GB. Samsung 16GB DDR4 2133 ECC REG. Model M393A2G40DB0-CPB

PSU - Seasonic 400W fanless

SSD - Cache - 1TB 850 Samsung EVO

SSD - VMs- 1TB 850 Samsung EVO

Storage - 4 x 8TB Seagate ST8000AS0002 HDD

Case - Fractal Design Define R5

Additional USB - Heise HPU-300NC USB3 2P PCIE

TV tuner card  - TBS6285

tower-diagnostics-20160104-1814.zip

Image.jpg.4681c75c56118971f521dd3d544c6cc0.jpg

Link to comment

Since this is a new build, have you run a memtst on your RAM?

 

I would also suggest that you request a Moderator to move this thread to the 'General Support (6)' section as it will probably get seen by a lot more people who could assist. IF you have a UPS connected to this server, see what the power draw is  when the server is highly loaded.  (400 watts might be a bit small....) 

 

Link to comment

System has been running fine for 2 days. Just now I went to stop a VM, and the host crashed after a few seconds.

 

After a reboot, starting and stopping VMs works fine. Just after some time I have the problem.

 

Here is dmesg. It seems to show some issues with one of the SSDs but I'm not sure if that is because of the crash, or causing the crash.

 

I'm really not sure what to try next...

 

dmesg.txt

Link to comment

I replaced the PSU with a Seasonic 520W fanless version and the issue remains.

 

I can replicate the crash if I leave the system for a few hours, and then reboot my home automation vm, and spin up the drives at around the same time using the spin-up button on the gui. All the VMs have usb devices passed through except the SageTV vm which has the PCIe card passed through.

 

I think the issue goes away if I take the Tuner card out.

 

Here are the bios settings, anyone see any problems, or have any other suggestions. It is very frustrating having an expensive machine sat in bits not able to use it.

 

Thanks for any suggestions.

 

Image.jpg.31ca8a4d848944249b279450fff43890.jpg

Image_2.jpg.96c85fbca261f24017e38c2de03a25c1.jpg

Image_3.jpg.8f9953b4933fcd9991df4522c4e358b0.jpg

Image_4.jpg.05589c697b3fffae0a3438a7dc7bc0f3.jpg

Link to comment

Anyone any ideas?

 

I tried with ACS override switched on, but still same issue.

 

I'm currently trying with the Homeseer VM not passing through any USB devices, as it only seems to hang when that VM is rebooted. That VM has an RFXTRx device which is an FTDI232 USB to serial.

 

 

 

 

 

Link to comment

That's a worrying review.

 

MB came from Novatech. I'll contact them to see if I can swap it out.

Everything else came from Scan.

 

Thanks for the tip on Novatech.

 

That review does sound a lot like the issues you are having so maybe it's a dodgy board. I'll follow the thread and see how you get on.

Link to comment

Some good news hopefully:-

 

I contacted Lime support and scheduled a troubleshooting session with Jonp on Tuesday. John was extremely helpful.

 

The outcome of the session was a series of changes to test.

 

The change that made the difference was adding iommu=pt into the syslinux.cfg

 

Since I did that on Tuesday night, the machine has not crashed.

 

I have put the system back to the original 400W PSU, with 7 HDDs and 3 SSDs, a TV tuner card, a USB PCIe card, and all USB sockets occupied.

 

I'll report back if the machine does crash. I'll breathe easier after a week!

 

Incidentally I did install Windows 10 on the bare metal as a test. I only ran it for a couple of hours but it didn't crash. John's view was that the hardware was fine, and that if I replaced the MB, the new one would do exactly the same. I think I would have struggled to return it to the supplier as faulty...

Link to comment
  • 4 weeks later...

Just to update, the system crashed after 25 days. And then again a few hours after reboot. I did have a syslog tail running to the flash drive and captured the following. It seems to indicate a CPU stall. I'm hoping 6.2 fixes it otherwise I guess I'll have to replace the motherboard with another model.

 

Feb 17 22:21:59 Tower autofan: Highest disk temp is 27°C, adjusting fan speed from: 56 (21% @ 877rpm) to: OFF (0% @ 877rpm)

 

Feb 17 22:23:08 Tower kernel: virbr0: port 8(vnet5) entered disabled state

 

Feb 17 22:23:08 Tower avahi-daemon[21435]: Withdrawing workstation service for vnet5.

 

Feb 17 22:23:08 Tower kernel: device vnet5 left promiscuous mode

 

Feb 17 22:23:08 Tower kernel: virbr0: port 8(vnet5) entered disabled state

 

Feb 17 22:23:08 Tower kernel: pl2303 1-7.4:1.0: pl2303 converter detected

 

Feb 17 22:23:08 Tower kernel: usb 1-7.4: pl2303 converter now attached to ttyUSB0

 

Feb 17 22:23:08 Tower kernel: ftdi_sio 1-12:1.0: FTDI USB Serial Device converter detected

 

Feb 17 22:23:08 Tower kernel: usb 1-12: Detected FT232RL

 

Feb 17 22:23:08 Tower kernel: usb 1-12: FTDI USB Serial Device converter now attached to ttyUSB1

 

Feb 17 22:24:49 Tower kernel: device vnet5 entered promiscuous mode

 

Feb 17 22:24:49 Tower kernel: virbr0: port 8(vnet5) entered listening state

 

Feb 17 22:24:49 Tower kernel: virbr0: port 8(vnet5) entered listening state

 

Feb 17 22:24:51 Tower kernel: vfio-pci 0000:00:1b.0: enabling device (0000 -> 0002)

 

Feb 17 22:24:51 Tower kernel: pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0

 

Feb 17 22:24:51 Tower kernel: pl2303 1-7.4:1.0: device disconnected

 

Feb 17 22:24:51 Tower kernel: ftdi_sio ttyUSB1: FTDI USB Serial Device converter now disconnected from ttyUSB1

 

Feb 17 22:24:51 Tower kernel: ftdi_sio 1-12:1.0: device disconnected

 

Feb 17 22:24:51 Tower kernel: usb 5-1: reset full-speed USB device number 4 using xhci_hcd

 

Feb 17 22:24:52 Tower kernel: usb 1-7.4: reset full-speed USB device number 14 using xhci_hcd

 

Feb 17 22:24:52 Tower kernel: usb 1-12: reset full-speed USB device number 8 using xhci_hcd

 

Feb 17 22:24:53 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound

 

Feb 17 22:25:04 Tower kernel: virbr0: port 8(vnet5) entered learning state

 

Feb 17 22:25:19 Tower kernel: virbr0: topology change detected, propagating

 

Feb 17 22:25:19 Tower kernel: virbr0: port 8(vnet5) entered forwarding state

 

Feb 17 22:25:56 Tower kernel: INFO: rcu_preempt detected stalls on CPUs/tasks: { 6} (detected by 4, t=60002 jiffies, g=3157173, c=3157172, q=205177)

 

Feb 17 22:25:56 Tower kernel: Task dump for CPU 6:

 

Feb 17 22:25:56 Tower kernel: qemu-system-x86 R  running task        0  8376      1 0x00000008

 

Feb 17 22:25:56 Tower kernel: ffff8810383e6000 ffff880a70763c78 000000018141bcb6 0000000000000000

 

Feb 17 22:25:56 Tower kernel: 0000000000000050 0000000000000000 0000000000000050 ffff8808c6e44a20

 

Feb 17 22:25:56 Tower kernel: 0000000000000050 ffff880a70763ca8 ffffffff810380c7 ffff880a70763ca8

 

Feb 17 22:25:56 Tower kernel: Call Trace:

 

Feb 17 22:25:56 Tower kernel: [<ffffffff810380c7>] ? setup_msi_irq+0x2b/0x8e

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8141bdd7>] ? intel_msi_alloc_irq+0xa6/0xb0

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8141d17a>] ? irq_remapping_setup_msi_irqs+0xd7/0x1e4

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8100e8c3>] ? arch_setup_msi_irqs+0xa/0xc

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8138dbec>] ? pci_enable_msi_range+0x1f6/0x294

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8148604f>] ? vfio_msi_disable+0xb5/0xb5

 

Feb 17 22:25:56 Tower kernel: [<ffffffff814861a8>] ? vfio_pci_set_msi_trigger+0x159/0x271

 

Feb 17 22:25:56 Tower kernel: [<ffffffff814866fe>] ? vfio_pci_set_irqs_ioctl+0x92/0x9c

 

Feb 17 22:25:56 Tower kernel: [<ffffffff81485255>] ? vfio_pci_ioctl+0x397/0x7be

 

Feb 17 22:25:56 Tower kernel: [<ffffffffa12aeedb>] ? kvm_vm_ioctl+0x33a/0x60b [kvm]

 

Feb 17 22:25:56 Tower kernel: [<ffffffff81481550>] ? vfio_device_fops_unl_ioctl+0x1e/0x28

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8110c316>] ? do_vfs_ioctl+0x367/0x421

 

Feb 17 22:25:56 Tower kernel: [<ffffffff81114033>] ? __fget+0x6c/0x78

 

Feb 17 22:25:56 Tower kernel: [<ffffffff8110c409>] ? SyS_ioctl+0x39/0x64

 

Feb 17 22:25:56 Tower kernel: [<ffffffff815f71ee>] ? system_call_fastpath+0x12/0x71

 

Feb 17 22:27:05 Tower autofan: Highest disk temp is 27°C, adjusting fan speed from: 56 (21% @ 877rpm) to: OFF (0% @ 868rpm)

 

 

Link to comment
  • 2 years later...

This system has been fine since Feb 2016 on version 6.2. But now it has crashed a 3 times over the last 2 weeks. Crashing started on v6.5.3 and I had another crash after upgrading to 6.6.1 last week.

 

Same CPU, Memory, PSU, Motherboard. A few additional drives. The PSU is a Seasonic 400W Platinum SS-400FL2.

 

The case is different - now one of these. https://www.logic-case.com/products/rackmount-chassis/3u/3u-server-case-w-16x-35-hot-swappable-satasas-drive-bays-minisas-atx-psu-sc-316-atx/

 

I attach diagnostics from the boot after the crash and a screenshot of the IPMI when it locked up. I don't know if the error shown is related to the lockup or not.

 

Any suggestions, or should I think about changing the MB and CPU?

 

Thanks.

Screenshot 2018-10-07 21.14.44.png

tower-diagnostics-20181009-2201.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.