al_uk Posted January 4, 2016 Share Posted January 4, 2016 Happy new year all. I built a new server just before Christmas. It has a PCIe USB card and also a quad tuner card passed through to a windows SageTV VM. There are 6 Windows VMs and 4 dockers running including a Blueiris cctv vm. It idles at around 20% CPU and 30% memory usage. ACS Override is set as "off" I am getting hard crashes every couple of days where the system simply hangs and needs to be reset. The console is frozen and does not respond to any input. The IPMI still works and I need to do a "reset" to bring the system back. Network connectivity is instantly lost. I can't see any logs. The most recent time this happened. I had just plugged in a new USB device (a zwave stick) and had gone to the KVM GUI to stop the windows vm that I wanted to assign the device to. I clicked Stop" and the system hung at that point. Previous times it has hung overnight when I was asleep and no-one else was making any changes. I've included the diagnostics - and also output from lsusb. I've also attached the IPMI screenshot. It shows "bad/missing sense data". I am not sure whether this happened at the time of the crash, or beforehand. Any pointers much appreciated. Thank you. MB - Supermicro X10SRA-F CPU - Intel Xeon E5-2620 V3 RAM - 64GB. Samsung 16GB DDR4 2133 ECC REG. Model M393A2G40DB0-CPB PSU - Seasonic 400W fanless SSD - Cache - 1TB 850 Samsung EVO SSD - VMs- 1TB 850 Samsung EVO Storage - 4 x 8TB Seagate ST8000AS0002 HDD Case - Fractal Design Define R5 Additional USB - Heise HPU-300NC USB3 2P PCIE TV tuner card - TBS6285 tower-diagnostics-20160104-1814.zip Quote Link to comment
Frank1940 Posted January 4, 2016 Share Posted January 4, 2016 Since this is a new build, have you run a memtst on your RAM? I would also suggest that you request a Moderator to move this thread to the 'General Support (6)' section as it will probably get seen by a lot more people who could assist. IF you have a UPS connected to this server, see what the power draw is when the server is highly loaded. (400 watts might be a bit small....) Quote Link to comment
al_uk Posted January 4, 2016 Author Share Posted January 4, 2016 Request to mods made. Thanks. Yes I ran a default Memtest which took about 4 hours and didn't throw up any errors. Cheers. Quote Link to comment
al_uk Posted January 4, 2016 Author Share Posted January 4, 2016 Power draw at the plug is around 100 watts when it is busy. That was with an extra couple of old 500GB drives in as well. Quote Link to comment
al_uk Posted January 6, 2016 Author Share Posted January 6, 2016 Anyone any ideas? Quote Link to comment
kaboooom2000uk Posted January 6, 2016 Share Posted January 6, 2016 This may not be a solution, but have you ensured that you have updated the BIOS and IPMI Firmware? I had to do this on my X8QB6-F to get the IPMI to work as it had a bug. Quote Link to comment
al_uk Posted January 6, 2016 Author Share Posted January 6, 2016 Thanks Kaboom. Yes I updated IPMI. Bios was already on version 1.00b. Is there any more detailed logging I should enable in Unraid, which will persist across reboots/crashes? Quote Link to comment
kaboooom2000uk Posted January 6, 2016 Share Posted January 6, 2016 Well I would look at dmseg, it may point to something if its kernel or hardware related, hopefully. Failing that, you could try in /var/logs I'm pretty new to this myself and also facing some issues of my own. Quote Link to comment
al_uk Posted January 7, 2016 Author Share Posted January 7, 2016 System has been running fine for 2 days. Just now I went to stop a VM, and the host crashed after a few seconds. After a reboot, starting and stopping VMs works fine. Just after some time I have the problem. Here is dmesg. It seems to show some issues with one of the SSDs but I'm not sure if that is because of the crash, or causing the crash. I'm really not sure what to try next... dmesg.txt Quote Link to comment
al_uk Posted January 13, 2016 Author Share Posted January 13, 2016 I replaced the PSU with a Seasonic 520W fanless version and the issue remains. I can replicate the crash if I leave the system for a few hours, and then reboot my home automation vm, and spin up the drives at around the same time using the spin-up button on the gui. All the VMs have usb devices passed through except the SageTV vm which has the PCIe card passed through. I think the issue goes away if I take the Tuner card out. Here are the bios settings, anyone see any problems, or have any other suggestions. It is very frustrating having an expensive machine sat in bits not able to use it. Thanks for any suggestions. Quote Link to comment
al_uk Posted January 13, 2016 Author Share Posted January 13, 2016 more bios pics Quote Link to comment
al_uk Posted January 13, 2016 Author Share Posted January 13, 2016 more bios pics Quote Link to comment
al_uk Posted January 15, 2016 Author Share Posted January 15, 2016 Anyone any ideas? I tried with ACS override switched on, but still same issue. I'm currently trying with the Homeseer VM not passing through any USB devices, as it only seems to hang when that VM is rebooted. That VM has an RFXTRx device which is an FTDI232 USB to serial. Quote Link to comment
al_uk Posted January 16, 2016 Author Share Posted January 16, 2016 Just had another lockup when rebooting the VM but with no USB devices attached. I was running tail on the console and I think I just caught it before the 1st error scrolled off the screen. Quote Link to comment
al_uk Posted January 16, 2016 Author Share Posted January 16, 2016 2nd screenshot Quote Link to comment
al_uk Posted January 17, 2016 Author Share Posted January 17, 2016 Tried again, with xhci disabled in the bios. I managed to capture some more messages by recording the IPMI before rebooting the VM that causes the crash. Here are the screenshots. Quote Link to comment
essjay Posted January 18, 2016 Share Posted January 18, 2016 Is the motherboard dodgy? I'm looking at this very same board for a build with the 2620v3 and spotted this review which has me concerned along with your issues http://www.newegg.com/Product/Product.aspx?Item=N82E16813182959 Where di you get the board from as I'm looking to get one in the UK/Europe? Quote Link to comment
al_uk Posted January 18, 2016 Author Share Posted January 18, 2016 That's a worrying review. MB came from Novatech. I'll contact them to see if I can swap it out. Everything else came from Scan. Quote Link to comment
essjay Posted January 18, 2016 Share Posted January 18, 2016 That's a worrying review. MB came from Novatech. I'll contact them to see if I can swap it out. Everything else came from Scan. Thanks for the tip on Novatech. That review does sound a lot like the issues you are having so maybe it's a dodgy board. I'll follow the thread and see how you get on. Quote Link to comment
essjay Posted January 21, 2016 Share Posted January 21, 2016 Any update? Did you manage to return the board? I'm still on the fence about this board or the ASRock X99 WS (£260 from Amazon) Quote Link to comment
al_uk Posted January 21, 2016 Author Share Posted January 21, 2016 Some good news hopefully:- I contacted Lime support and scheduled a troubleshooting session with Jonp on Tuesday. John was extremely helpful. The outcome of the session was a series of changes to test. The change that made the difference was adding iommu=pt into the syslinux.cfg Since I did that on Tuesday night, the machine has not crashed. I have put the system back to the original 400W PSU, with 7 HDDs and 3 SSDs, a TV tuner card, a USB PCIe card, and all USB sockets occupied. I'll report back if the machine does crash. I'll breathe easier after a week! Incidentally I did install Windows 10 on the bare metal as a test. I only ran it for a couple of hours but it didn't crash. John's view was that the hardware was fine, and that if I replaced the MB, the new one would do exactly the same. I think I would have struggled to return it to the supplier as faulty... Quote Link to comment
al_uk Posted February 18, 2016 Author Share Posted February 18, 2016 Just to update, the system crashed after 25 days. And then again a few hours after reboot. I did have a syslog tail running to the flash drive and captured the following. It seems to indicate a CPU stall. I'm hoping 6.2 fixes it otherwise I guess I'll have to replace the motherboard with another model. Feb 17 22:21:59 Tower autofan: Highest disk temp is 27°C, adjusting fan speed from: 56 (21% @ 877rpm) to: OFF (0% @ 877rpm) Feb 17 22:23:08 Tower kernel: virbr0: port 8(vnet5) entered disabled state Feb 17 22:23:08 Tower avahi-daemon[21435]: Withdrawing workstation service for vnet5. Feb 17 22:23:08 Tower kernel: device vnet5 left promiscuous mode Feb 17 22:23:08 Tower kernel: virbr0: port 8(vnet5) entered disabled state Feb 17 22:23:08 Tower kernel: pl2303 1-7.4:1.0: pl2303 converter detected Feb 17 22:23:08 Tower kernel: usb 1-7.4: pl2303 converter now attached to ttyUSB0 Feb 17 22:23:08 Tower kernel: ftdi_sio 1-12:1.0: FTDI USB Serial Device converter detected Feb 17 22:23:08 Tower kernel: usb 1-12: Detected FT232RL Feb 17 22:23:08 Tower kernel: usb 1-12: FTDI USB Serial Device converter now attached to ttyUSB1 Feb 17 22:24:49 Tower kernel: device vnet5 entered promiscuous mode Feb 17 22:24:49 Tower kernel: virbr0: port 8(vnet5) entered listening state Feb 17 22:24:49 Tower kernel: virbr0: port 8(vnet5) entered listening state Feb 17 22:24:51 Tower kernel: vfio-pci 0000:00:1b.0: enabling device (0000 -> 0002) Feb 17 22:24:51 Tower kernel: pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0 Feb 17 22:24:51 Tower kernel: pl2303 1-7.4:1.0: device disconnected Feb 17 22:24:51 Tower kernel: ftdi_sio ttyUSB1: FTDI USB Serial Device converter now disconnected from ttyUSB1 Feb 17 22:24:51 Tower kernel: ftdi_sio 1-12:1.0: device disconnected Feb 17 22:24:51 Tower kernel: usb 5-1: reset full-speed USB device number 4 using xhci_hcd Feb 17 22:24:52 Tower kernel: usb 1-7.4: reset full-speed USB device number 14 using xhci_hcd Feb 17 22:24:52 Tower kernel: usb 1-12: reset full-speed USB device number 8 using xhci_hcd Feb 17 22:24:53 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound Feb 17 22:25:04 Tower kernel: virbr0: port 8(vnet5) entered learning state Feb 17 22:25:19 Tower kernel: virbr0: topology change detected, propagating Feb 17 22:25:19 Tower kernel: virbr0: port 8(vnet5) entered forwarding state Feb 17 22:25:56 Tower kernel: INFO: rcu_preempt detected stalls on CPUs/tasks: { 6} (detected by 4, t=60002 jiffies, g=3157173, c=3157172, q=205177) Feb 17 22:25:56 Tower kernel: Task dump for CPU 6: Feb 17 22:25:56 Tower kernel: qemu-system-x86 R running task 0 8376 1 0x00000008 Feb 17 22:25:56 Tower kernel: ffff8810383e6000 ffff880a70763c78 000000018141bcb6 0000000000000000 Feb 17 22:25:56 Tower kernel: 0000000000000050 0000000000000000 0000000000000050 ffff8808c6e44a20 Feb 17 22:25:56 Tower kernel: 0000000000000050 ffff880a70763ca8 ffffffff810380c7 ffff880a70763ca8 Feb 17 22:25:56 Tower kernel: Call Trace: Feb 17 22:25:56 Tower kernel: [<ffffffff810380c7>] ? setup_msi_irq+0x2b/0x8e Feb 17 22:25:56 Tower kernel: [<ffffffff8141bdd7>] ? intel_msi_alloc_irq+0xa6/0xb0 Feb 17 22:25:56 Tower kernel: [<ffffffff8141d17a>] ? irq_remapping_setup_msi_irqs+0xd7/0x1e4 Feb 17 22:25:56 Tower kernel: [<ffffffff8100e8c3>] ? arch_setup_msi_irqs+0xa/0xc Feb 17 22:25:56 Tower kernel: [<ffffffff8138dbec>] ? pci_enable_msi_range+0x1f6/0x294 Feb 17 22:25:56 Tower kernel: [<ffffffff8148604f>] ? vfio_msi_disable+0xb5/0xb5 Feb 17 22:25:56 Tower kernel: [<ffffffff814861a8>] ? vfio_pci_set_msi_trigger+0x159/0x271 Feb 17 22:25:56 Tower kernel: [<ffffffff814866fe>] ? vfio_pci_set_irqs_ioctl+0x92/0x9c Feb 17 22:25:56 Tower kernel: [<ffffffff81485255>] ? vfio_pci_ioctl+0x397/0x7be Feb 17 22:25:56 Tower kernel: [<ffffffffa12aeedb>] ? kvm_vm_ioctl+0x33a/0x60b [kvm] Feb 17 22:25:56 Tower kernel: [<ffffffff81481550>] ? vfio_device_fops_unl_ioctl+0x1e/0x28 Feb 17 22:25:56 Tower kernel: [<ffffffff8110c316>] ? do_vfs_ioctl+0x367/0x421 Feb 17 22:25:56 Tower kernel: [<ffffffff81114033>] ? __fget+0x6c/0x78 Feb 17 22:25:56 Tower kernel: [<ffffffff8110c409>] ? SyS_ioctl+0x39/0x64 Feb 17 22:25:56 Tower kernel: [<ffffffff815f71ee>] ? system_call_fastpath+0x12/0x71 Feb 17 22:27:05 Tower autofan: Highest disk temp is 27°C, adjusting fan speed from: 56 (21% @ 877rpm) to: OFF (0% @ 868rpm) Quote Link to comment
al_uk Posted October 9, 2018 Author Share Posted October 9, 2018 This system has been fine since Feb 2016 on version 6.2. But now it has crashed a 3 times over the last 2 weeks. Crashing started on v6.5.3 and I had another crash after upgrading to 6.6.1 last week. Same CPU, Memory, PSU, Motherboard. A few additional drives. The PSU is a Seasonic 400W Platinum SS-400FL2. The case is different - now one of these. https://www.logic-case.com/products/rackmount-chassis/3u/3u-server-case-w-16x-35-hot-swappable-satasas-drive-bays-minisas-atx-psu-sc-316-atx/ I attach diagnostics from the boot after the crash and a screenshot of the IPMI when it locked up. I don't know if the error shown is related to the lockup or not. Any suggestions, or should I think about changing the MB and CPU? Thanks. tower-diagnostics-20181009-2201.zip Quote Link to comment
al_uk Posted October 13, 2018 Author Share Posted October 13, 2018 Hi, anyone got any suggestions? Cheers! Quote Link to comment
JorgeB Posted October 13, 2018 Share Posted October 13, 2018 Is it stable if you go back to say 6.4.0? You need to find out if something on your hardware went bad or if it is a kernel/microcode issue. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.