August 23, 201411 yr Sorry for posting here, but i can not post in the defect section unRAID OS Version: 6b6, still in 6b7 Description: The Server crashes randomly. just a hard reset will bring it back to life. How to reproduce: Start a KVM - VM, wait some time, server crashes randomly. I don't have a step by step guide how to reproduce. But if i start the VM the server will crash within 6h. Without a VM the server will last for weeks, but sometimes there is still a crash. vm start virsh create /mnt/cache/vm/Debian_VM/debian.xml debian.xml: <domain type='kvm'> <name>Debian</name> <uuid>ba6891bd-32cf-2236-2525-6e12bb490e9e</uuid> <memory unit='GB'>1</memory> <currentMemory unit='GB'>1</currentMemory> <vcpu>2</vcpu> <os> <type>hvm</type> <loader>/usr/share/qemu/bios-256k.bin</loader> <boot dev='hd'/> </os> <features> <acpi/> <apic eoi='on'/> <pae/> </features> <clock offset='localtime'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='yes'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/bin/qemu-system-x86_64</emulator> <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/mnt/cache/vm/Debian_VM/debian-7.6.0-amd64-netinst.iso'/> <backingStore/> <target dev='hdc' bus='ide'/> <readonly/> </disk> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/mnt/cache/vm/Debian_VM/debian.qcow2'/> <backingStore/> <target dev='vda' bus='virtio'/> </disk> <interface type='bridge'> <source bridge='br0'/> <model type='virtio'/> </interface> <!--serial type='pty'> <source path='/dev/pts/2'/> <target port='0'/> </serial> <console type='pty' tty='/dev/pts/2'> <source path='/dev/pts/2'/> <target type='serial' port='0'/> </console--> <graphics type='vnc' port='5901' sharePolicy='allow-exclusive' keymap='de'> <listen type='address' address='192.168.178.11'/> </graphics> <video> <model type='cirrus' vram='9216' heads='1'> <acceleration accel3d='yes' accel2d='yes'/> </model> </video> <filesystem type='mount' accessmode='passthrough'> <source dir='/mnt/cache/'/> <target dir='cache'/> </filesystem> </devices> </domain> Other information: Hardware: MB: ASUS AM1M-A (90MB0IR0-M0EAY0) CPU: AMD Athlon 5350, 4x 2.05GHz RAM: 4GB Corsair SATA Expansion: Digitus DS-30104-1, PCIe x2 DISKs: Array Device Identification Temp. Size Used Free Reads Writes Errors FS View Spin DownParity WDC_WD30EZRX-00DC0B0_WD-WMC1T1615443 (sde) 2930266532 32 °C 3 TB - 6.494.887 37 0 Spin DownDisk 1 WDC_WD30EZRX-00DC0B0_WD-WMC1T1982270 (sdb) 2930266532 33 °C 3 TB 2,97 TB 29,9 GB 8.415.129 12 0 reiserfs Browse /mnt/disk1 Spin DownDisk 2 SAMSUNG_HD204UI_S2H7JX0B300872 (sdd) 1953514552 31 °C 2 TB 654 GB 1,35 TB 846.733 31 0 reiserfs Browse /mnt/disk2 Total (Size/Used/Free excludes Parity) 32 °C 5 TB 3,62 TB 1,38 TB 15.756.749 80 0 Cache Device Identification Temp. Size Used Free Reads Writes Errors FS View Spin DownCache FUJITSU_MHY2200BH_K42WT7B26SSN (sdc) 195360952 31 °C 200 GB 118 GB 81,7 GB 315 79 0 btrfs syslog: Aug 23 16:36:25 Tower kernel: cgroup: docker (2184) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future. Aug 23 16:36:25 Tower kernel: cgroup: "memory" requires setting use_hierarchy to 1 on the root Aug 23 16:36:25 Tower kernel: netlink: 1 bytes leftover after parsing attributes in process `docker'. Aug 23 16:36:25 Tower kernel: device veth8408 entered promiscuous mode Aug 23 16:36:25 Tower avahi-daemon[1867]: Withdrawing workstation service for veth2fb5. Aug 23 16:36:25 Tower kernel: docker0: port 1(veth8408) entered forwarding state Aug 23 16:36:25 Tower kernel: docker0: port 1(veth8408) entered forwarding state Aug 23 16:36:25 Tower php: ownCloud Aug 23 16:36:40 Tower kernel: docker0: port 1(veth8408) entered forwarding state Aug 23 16:36:43 Tower sshd[2247]: Accepted password for root from 192.168.178.51 port 63078 ssh2 Aug 23 16:39:16 Tower sshd[2287]: Accepted password for root from 192.168.178.51 port 63091 ssh2 Aug 23 16:39:44 Tower kernel: device vnet0 entered promiscuous mode Aug 23 16:39:45 Tower kernel: br0: port 2(vnet0) entered listening state Aug 23 16:39:45 Tower kernel: br0: port 2(vnet0) entered listening state Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 ignored rdmsr: 0xc0010001 Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 ignored rdmsr: 0xc0010002 Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 ignored rdmsr: 0xc0010003 Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xabcd Aug 23 16:40:00 Tower kernel: br0: port 2(vnet0) entered learning state Aug 23 16:40:15 Tower kernel: br0: topology change detected, propagating Aug 23 16:40:15 Tower kernel: br0: port 2(vnet0) entered forwarding state Aug 23 17:04:35 Tower kernel: mdcmd (35): spindown 1 Aug 23 17:33:06 Tower kernel: mdcmd (36): spindown 0 Aug 23 17:33:06 Tower kernel: mdcmd (37): spindown 2 Attached picture of monitor output hope this little info can help, I will help to priovide more info. Regards Owel
August 23, 201411 yr Wow, this is probably one of the most thorough reports I've seen! Thank you! I will have to investigate and attempt to replicate on my end as well. I may have some alternate settings for you to try on your XML file to improve this for you. I will get back to you next week...
August 23, 201411 yr Author Thanks for your fast reply. On my side i will try to get more info. Perhaps I can fetch a error in syslog. VM is a big point. But I also had the crash when no VM is running... Will also try to start unraid in safemode and see if the error happens. Do you need any more info?
August 25, 201411 yr Author Yesterday I had some time, so i investigated further. Memtest: 4h without any error Prime: without error Unraid Safe mode: was running fine, after ~10h, crash. no error visable in syslog. SMART: did a 'smartctl -t short /dev/sdX' on all drives, everything fine Any suggestions which tests I can do? Owel
August 25, 201411 yr For 'statistical robustness', you should repeat these tests several more times, and greatly lengthen the memory test. The crash points seem somewhat random, and you only have a few data points, so you don't really know yet the probable run times under different conditions and tests. For example, is the 10 hour runtime for Safe Mode about normal, or does it sometime crash after only 10 minutes, and other times runs over 30 hours? It may not be a memory issue, but you haven't ruled memory issues out yet. And because this still strongly implicates memory, that has to be ruled out first, before looking at other causes. The fact that running with VM's appears to crash within 6 hours, and running without VM's seems to run indefinitely, does not rule out a memory issue, because memory problems can be very tricky. Running a significantly different configuration exercises the memory differently, and one configuration may happen to be the one that particularly exercises a marginal bit of memory, in just the right way to cause it to fail. I'm sure you can see that testing memory for only 4 hours is not enough, when you are seeing crashes after 10 hours. I would test it at least twice as long as the longest runtime duration. I'm not saying it IS a memory problem, but you need to rule it out first before we can definitively say it isn't. Once you've essentially proved it isn't memory, then I would try the Prime test again, for the same much longer duration. If you do determine that Safe Mode generally crashes in 8 to 12 hours, then any hardware test (Memtest, Prime, etc) needs to run at least as long, preferably 2 to 3 times as long. Another thing to check is overheating. Running with VM's *may* be running the machine harder. In particular, check the bridge chipsets after running awhile under a good load. Make sure any fans on them are operational. If you can come up with a specific configuration or test with high repeatability of runtime before crash (the shorter the better of course), then try opening the case up and placing a strong house fan just outside, and see if it makes any difference at all. If no difference, then probably not a temp issue. Once the basics are eliminated, there are other items to check. I have to admit this is the first time I have heard of a Digitus SATA card, and that makes me suspicious of it (perhaps wrongly!). If it is truly well known and well tested, then I apologize. But since it seems new, it may be buggy. I would check for a firmware update for it. For thoroughness, it would be nice if you could try running without it and testing, but I don't know if that is possible for you. I am trying to help, really, not just make life harder for you!
August 25, 201411 yr Author You are not making my life harder, this weird server crash does!! I can now make a server crash within minutes. But it is still a random behavior, and i get no error log! start server, start owncloud docker, login to OC, sync calendar on android phone, login to tower via ssh, -> server crash within 5 min. start server, stop array, idle for 5h, start array start OC docker, trying to login -> server crash no VM running, just the docker. But sometimes, this docker runs for hours. In real life I'm a computer engineer, doing all day computer stuff, including linux etc. But I have no clue what is causing the error. The memory ran in a different system with about a year. no problems. New are MB, CPU, SATA expansion. the MB has only 2 SATA ports, so i can not test without the Digitus. I have an old atom board with 4 SATA ports here, but this is not able to do virtualisazion (will docker work without vt-x?). memtest is now running. No error so far.
August 25, 201411 yr ... But I have no clue what is causing the error. Me neither. And you have a lot of possibilities! The memory ran in a different system with about a year. no problems. As I'm sure you know, memory that runs fine in one system may not run well in another, differing voltages and other electrical characteristics and differing timings and other settings. If it weren't true, then motherboard manufacturers wouldn't have to provide lists of approved memory for their boards, thereby disallowing otherwise perfectly good memory chips.
August 26, 201411 yr Do a Memtest with smp enabled. That way the memory subsystem will be stressed by multithreaded access requests. If it passes a full test cycle of that then it is fine. I've had instances where my server was failing when under load the memory passed the default Memtest and wasn't until I did it with smp enabled that it failed. Changed the RAM and no problems since. The problem RAM is in my htpc and work fine there. Sent from my LG-D802T using Tapatalk
August 26, 201411 yr Author Did a memtest for about 13h with smp. 14 full cycles without an error. Is there a kind of test to verify the sata expansion card? Any more suggestions?
August 26, 201411 yr You have only 4GB memory, any chance you can increase that? The screenshot doesn't show anything useful. When it crashes is this what spews out every time? How about at the very beginning - possible to catch that?
August 26, 201411 yr Author So far I have not more RAM available at home. But I wanted to increase RAM when VMs and Dockers are running fine. So if it helps to find the error, I can buy them now. why not. When it crashes their is either a freeze of the current Screen, or the Screenshot. Whereby the letters and the blue letters change fast. I'm not able to catch what happens right before, but as far es I see, there is absolute nothing before that indicates an error will occur. Did a shh session tail -f /var/log/syslog -> no error relevant output. And the rest I already posted. So do you have any other things to test for me?
August 27, 201411 yr unRAID OS Version: 6b7 Description: I'm having random crashes too, the difference is that I'm runing unraid in a VM from archlinux. How to reproduce: Boot archlinux - start unraid VM - start Dockers (couchpotato, nzbdrone, sabnzbd, mediabrowser) - after random hours (may be in 2 or in 12 hrs) I can't access the web gui or dockers guis... virsh list says UNRAID is running but after destroy it and restart I notice it was dead long ago. Also mediabrowser docker is not running propperly, XBMB3C doesn't recognize it anymore, in beta 6 it was running fine Other information: Setup: http://lime-technology.com/forum/index.php?topic=33942.0 Dom0 – Arch Linux KVM (libvirtd and quemu) XBMC with HDMI on i5 4430 HD4600 (Running from USB drive in order to passthrough SATA) DomU1 – unRAID 6.0 beta7 - Dockers PCI Passthrough of MB SATA controller CPU: i5 4430 MB: Gigabyte B85N GPU: none (Intel HD4600) RAM: 8Gb HDD: 4TB 3.5" XFS, 1TB 3.5" XFS, 80GB 2.5" BTRFS(from old laptop for dockers) syslog.txt
September 1, 201411 yr Author Hi, I've added new RAM (8GB) to my server and it is running for 3Days without a Crash. Including OC-Docker and Debian VM. Seems to be a odd behavior. I'm not realy satisfied. I keep looking for the error. @TOM: You suggested to upgrade my RAM, can you explain me why? Seems working...
September 8, 201411 yr Author A week is over, and no random server crash. ?! So i think it was the RAM. Put the "defect" RAM into another Machine, no problems...
September 8, 201411 yr unRAID OS Version: 6b7 Description: I'm having random crashes too, the difference is that I'm runing unraid in a VM from archlinux. How to reproduce: Boot archlinux - start unraid VM - start Dockers (couchpotato, nzbdrone, sabnzbd, mediabrowser) - after random hours (may be in 2 or in 12 hrs) I can't access the web gui or dockers guis... virsh list says UNRAID is running but after destroy it and restart I notice it was dead long ago. Also mediabrowser docker is not running propperly, XBMB3C doesn't recognize it anymore, in beta 6 it was running fine Other information: Setup: http://lime-technology.com/forum/index.php?topic=33942.0 Dom0 – Arch Linux KVM (libvirtd and quemu) XBMC with HDMI on i5 4430 HD4600 (Running from USB drive in order to passthrough SATA) DomU1 – unRAID 6.0 beta7 - Dockers PCI Passthrough of MB SATA controller CPU: i5 4430 MB: Gigabyte B85N GPU: none (Intel HD4600) RAM: 8Gb HDD: 4TB 3.5" XFS, 1TB 3.5" XFS, 80GB 2.5" BTRFS(from old laptop for dockers) What is your host OS / kernel? We may have outpaced our underlying stack. We're running 3.16.0 in beta 8. EDIT: Marking "solved" as the OP confirmed this to be an issue with bad RAM.
Archived
This topic is now archived and is closed to further replies.