[DEFECT] Random Server Crash

August 23, 201411 yr

Sorry for posting here, but i can not post in the defect section

unRAID OS Version: 6b6, still in 6b7

Description: The Server crashes randomly. just a hard reset will bring it back to life.

How to reproduce: Start a KVM - VM, wait some time, server crashes randomly. I don't have a step by step guide how to reproduce. But if i start the VM the server will crash within 6h. Without a VM the server will last for weeks, but sometimes there is still a crash.

vm start

 virsh create /mnt/cache/vm/Debian_VM/debian.xml

debian.xml:

<domain type='kvm'>
  <name>Debian</name>
  <uuid>ba6891bd-32cf-2236-2525-6e12bb490e9e</uuid>

  <memory unit='GB'>1</memory>
  <currentMemory unit='GB'>1</currentMemory>

  <vcpu>2</vcpu>

  <os>
    <type>hvm</type>
   <loader>/usr/share/qemu/bios-256k.bin</loader>
    <boot dev='hd'/>
  </os>

  <features>
    <acpi/>
    <apic eoi='on'/>
    <pae/>
  </features>

  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='yes'/>
  </clock>


  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>

  <devices>

    <emulator>/usr/bin/qemu-system-x86_64</emulator>

<disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/cache/vm/Debian_VM/debian-7.6.0-amd64-netinst.iso'/>
      <backingStore/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
    </disk>


    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/mnt/cache/vm/Debian_VM/debian.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
    </disk>


   <interface type='bridge'>
      <source bridge='br0'/>
      <model type='virtio'/>
    </interface>

    <!--serial type='pty'>
      <source path='/dev/pts/2'/>
      <target port='0'/>
    </serial>

    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='serial' port='0'/>
    </console-->


    <graphics type='vnc' port='5901' sharePolicy='allow-exclusive' keymap='de'>
<listen type='address' address='192.168.178.11'/>
</graphics>


    <video>
      <model type='cirrus' vram='9216' heads='1'>
        <acceleration accel3d='yes' accel2d='yes'/>
      </model>
    </video>



<filesystem type='mount' accessmode='passthrough'>
    <source dir='/mnt/cache/'/>
    <target dir='cache'/>
</filesystem>

  </devices>
</domain>

Other information:

Hardware:

MB: ASUS AM1M-A (90MB0IR0-M0EAY0)

CPU: AMD Athlon 5350, 4x 2.05GHz

RAM: 4GB Corsair

SATA Expansion: Digitus DS-30104-1, PCIe x2

DISKs:

Array Device	Identification	Temp.	Size	Used	Free	Reads	Writes	Errors	FS	View
Spin DownParity	WDC_WD30EZRX-00DC0B0_WD-WMC1T1615443 (sde) 2930266532	32 °C	3 TB		-	6.494.887	37	0		
Spin DownDisk 1	WDC_WD30EZRX-00DC0B0_WD-WMC1T1982270 (sdb) 2930266532	33 °C	3 TB	2,97 TB	29,9 GB	8.415.129	12	0	reiserfs	Browse /mnt/disk1
Spin DownDisk 2	SAMSUNG_HD204UI_S2H7JX0B300872 (sdd) 1953514552	31 °C	2 TB	654 GB	1,35 TB	846.733	31	0	reiserfs	Browse /mnt/disk2
Total	(Size/Used/Free excludes Parity)	32 °C	5 TB	3,62 TB	1,38 TB	15.756.749	80	0		
Cache Device	Identification	Temp.	Size	Used	Free	Reads	Writes	Errors	FS	View
Spin DownCache	FUJITSU_MHY2200BH_K42WT7B26SSN (sdc) 195360952	31 °C	200 GB	118 GB	81,7 GB	315	79	0	btrfs

syslog:

Aug 23 16:36:25 Tower kernel: cgroup: docker (2184) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Aug 23 16:36:25 Tower kernel: cgroup: "memory" requires setting use_hierarchy to 1 on the root
Aug 23 16:36:25 Tower kernel: netlink: 1 bytes leftover after parsing attributes in process `docker'.
Aug 23 16:36:25 Tower kernel: device veth8408 entered promiscuous mode
Aug 23 16:36:25 Tower avahi-daemon[1867]: Withdrawing workstation service for veth2fb5.
Aug 23 16:36:25 Tower kernel: docker0: port 1(veth8408) entered forwarding state
Aug 23 16:36:25 Tower kernel: docker0: port 1(veth8408) entered forwarding state
Aug 23 16:36:25 Tower php: ownCloud
Aug 23 16:36:40 Tower kernel: docker0: port 1(veth8408) entered forwarding state
Aug 23 16:36:43 Tower sshd[2247]: Accepted password for root from 192.168.178.51 port 63078 ssh2
Aug 23 16:39:16 Tower sshd[2287]: Accepted password for root from 192.168.178.51 port 63091 ssh2
Aug 23 16:39:44 Tower kernel: device vnet0 entered promiscuous mode
Aug 23 16:39:45 Tower kernel: br0: port 2(vnet0) entered listening state
Aug 23 16:39:45 Tower kernel: br0: port 2(vnet0) entered listening state
Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 ignored rdmsr: 0xc0010001
Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 ignored rdmsr: 0xc0010002
Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 ignored rdmsr: 0xc0010003
Aug 23 16:39:59 Tower kernel: kvm [2304]: vcpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xabcd
Aug 23 16:40:00 Tower kernel: br0: port 2(vnet0) entered learning state
Aug 23 16:40:15 Tower kernel: br0: topology change detected, propagating
Aug 23 16:40:15 Tower kernel: br0: port 2(vnet0) entered forwarding state
Aug 23 17:04:35 Tower kernel: mdcmd (35): spindown 1
Aug 23 17:33:06 Tower kernel: mdcmd (36): spindown 0
Aug 23 17:33:06 Tower kernel: mdcmd (37): spindown 2

Attached picture of monitor output

hope this little info can help, I will help to priovide more info.

Regards Owel

Quote

August 23, 201411 yr

Wow, this is probably one of the most thorough reports I've seen! Thank you! I will have to investigate and attempt to replicate on my end as well. I may have some alternate settings for you to try on your XML file to improve this for you. I will get back to you next week...

Quote

August 23, 201411 yr

Author

Thanks for your fast reply.

On my side i will try to get more info. Perhaps I can fetch a error in syslog.

VM is a big point. But I also had the crash when no VM is running...

Will also try to start unraid in safemode and see if the error happens.

Do you need any more info?

Quote

August 25, 201411 yr

Author

Yesterday I had some time, so i investigated further.

Memtest: 4h without any error

Prime: without error

Unraid Safe mode: was running fine, after ~10h, crash. no error visable in syslog.

SMART: did a 'smartctl -t short /dev/sdX' on all drives, everything fine

Any suggestions which tests I can do?

Owel

Quote

August 25, 201411 yr

For 'statistical robustness', you should repeat these tests several more times, and greatly lengthen the memory test. The crash points seem somewhat random, and you only have a few data points, so you don't really know yet the probable run times under different conditions and tests. For example, is the 10 hour runtime for Safe Mode about normal, or does it sometime crash after only 10 minutes, and other times runs over 30 hours?

It may not be a memory issue, but you haven't ruled memory issues out yet. And because this still strongly implicates memory, that has to be ruled out first, before looking at other causes. The fact that running with VM's appears to crash within 6 hours, and running without VM's seems to run indefinitely, does not rule out a memory issue, because memory problems can be very tricky. Running a significantly different configuration exercises the memory differently, and one configuration may happen to be the one that particularly exercises a marginal bit of memory, in just the right way to cause it to fail. I'm sure you can see that testing memory for only 4 hours is not enough, when you are seeing crashes after 10 hours. I would test it at least twice as long as the longest runtime duration. I'm not saying it IS a memory problem, but you need to rule it out first before we can definitively say it isn't.

Once you've essentially proved it isn't memory, then I would try the Prime test again, for the same much longer duration. If you do determine that Safe Mode generally crashes in 8 to 12 hours, then any hardware test (Memtest, Prime, etc) needs to run at least as long, preferably 2 to 3 times as long.

Another thing to check is overheating. Running with VM's *may* be running the machine harder. In particular, check the bridge chipsets after running awhile under a good load. Make sure any fans on them are operational. If you can come up with a specific configuration or test with high repeatability of runtime before crash (the shorter the better of course), then try opening the case up and placing a strong house fan just outside, and see if it makes any difference at all. If no difference, then probably not a temp issue.

Once the basics are eliminated, there are other items to check. I have to admit this is the first time I have heard of a Digitus SATA card, and that makes me suspicious of it (perhaps wrongly!). If it is truly well known and well tested, then I apologize. But since it seems new, it may be buggy. I would check for a firmware update for it. For thoroughness, it would be nice if you could try running without it and testing, but I don't know if that is possible for you.

I am trying to help, really, not just make life harder for you!

Quote

August 25, 201411 yr

Author

You are not making my life harder, this weird server crash does!!

I can now make a server crash within minutes. But it is still a random behavior, and i get no error log!

start server, start owncloud docker, login to OC, sync calendar on android phone, login to tower via ssh, -> server crash within 5 min.

start server, stop array, idle for 5h, start array start OC docker, trying to login -> server crash

no VM running, just the docker. But sometimes, this docker runs for hours.

In real life I'm a computer engineer, doing all day computer stuff, including linux etc. But I have no clue what is causing the error.

The memory ran in a different system with about a year. no problems.

New are MB, CPU, SATA expansion.

the MB has only 2 SATA ports, so i can not test without the Digitus. I have an old atom board with 4 SATA ports here, but this is not able to do virtualisazion (will docker work without vt-x?).

memtest is now running. No error so far.

Quote

August 25, 201411 yr

... But I have no clue what is causing the error.

Me neither. And you have a lot of possibilities!

The memory ran in a different system with about a year. no problems.

As I'm sure you know, memory that runs fine in one system may not run well in another, differing voltages and other electrical characteristics and differing timings and other settings. If it weren't true, then motherboard manufacturers wouldn't have to provide lists of approved memory for their boards, thereby disallowing otherwise perfectly good memory chips.

Quote

August 26, 201411 yr

Do a Memtest with smp enabled. That way the memory subsystem will be stressed by multithreaded access requests. If it passes a full test cycle of that then it is fine.

I've had instances where my server was failing when under load the memory passed the default Memtest and wasn't until I did it with smp enabled that it failed.

Changed the RAM and no problems since. The problem RAM is in my htpc and work fine there.

Sent from my LG-D802T using Tapatalk

Quote

August 26, 201411 yr

Author

Did a memtest for about 13h with smp.

14 full cycles without an error.

Is there a kind of test to verify the sata expansion card?

Any more suggestions?

Quote

August 26, 201411 yr

You have only 4GB memory, any chance you can increase that?

The screenshot doesn't show anything useful. When it crashes is this what spews out every time? How about at the very beginning - possible to catch that?

Quote

August 26, 201411 yr

Author

So far I have not more RAM available at home.

But I wanted to increase RAM when VMs and Dockers are running fine. So if it helps to find the error, I can buy them now. why not.

When it crashes their is either a freeze of the current Screen, or the Screenshot. Whereby the letters and the blue letters change fast.

I'm not able to catch what happens right before, but as far es I see, there is absolute nothing before that indicates an error will occur.

Did a shh session tail -f /var/log/syslog -> no error relevant output. And the rest I already posted.

So do you have any other things to test for me?

Quote

August 27, 201411 yr

unRAID OS Version: 6b7

Description: I'm having random crashes too, the difference is that I'm runing unraid in a VM from archlinux.

How to reproduce: Boot archlinux - start unraid VM - start Dockers (couchpotato, nzbdrone, sabnzbd, mediabrowser) - after random hours (may be in 2 or in 12 hrs) I can't access the web gui or dockers guis... virsh list says UNRAID is running but after destroy it and restart I notice it was dead long ago.

Also mediabrowser docker is not running propperly, XBMB3C doesn't recognize it anymore, in beta 6 it was running fine

Other information:

Setup: http://lime-technology.com/forum/index.php?topic=33942.0

Dom0 – Arch Linux

KVM (libvirtd and quemu)

XBMC with HDMI on i5 4430 HD4600

(Running from USB drive in order to passthrough SATA)

DomU1 – unRAID 6.0 beta7 - Dockers

PCI Passthrough of MB SATA controller

CPU: i5 4430

MB: Gigabyte B85N

GPU: none (Intel HD4600)

RAM: 8Gb

HDD: 4TB 3.5" XFS, 1TB 3.5" XFS, 80GB 2.5" BTRFS(from old laptop for dockers)

syslog.txt

Quote

September 1, 201411 yr

Author

Hi,

I've added new RAM (8GB) to my server and it is running for 3Days without a Crash.

Including OC-Docker and Debian VM.

Seems to be a odd behavior. I'm not realy satisfied. I keep looking for the error.

@TOM: You suggested to upgrade my RAM, can you explain me why? Seems working...

Quote

September 8, 201411 yr

Author

A week is over, and no random server crash. ?!

So i think it was the RAM. Put the "defect" RAM into another Machine, no problems...

Quote

September 8, 201411 yr

unRAID OS Version: 6b7

Description: I'm having random crashes too, the difference is that I'm runing unraid in a VM from archlinux.

How to reproduce: Boot archlinux - start unraid VM - start Dockers (couchpotato, nzbdrone, sabnzbd, mediabrowser) - after random hours (may be in 2 or in 12 hrs) I can't access the web gui or dockers guis... virsh list says UNRAID is running but after destroy it and restart I notice it was dead long ago.

Also mediabrowser docker is not running propperly, XBMB3C doesn't recognize it anymore, in beta 6 it was running fine

Other information:

Setup: http://lime-technology.com/forum/index.php?topic=33942.0

Dom0 – Arch Linux

KVM (libvirtd and quemu)

XBMC with HDMI on i5 4430 HD4600

(Running from USB drive in order to passthrough SATA)

DomU1 – unRAID 6.0 beta7 - Dockers

PCI Passthrough of MB SATA controller

CPU: i5 4430

MB: Gigabyte B85N

GPU: none (Intel HD4600)

RAM: 8Gb

HDD: 4TB 3.5" XFS, 1TB 3.5" XFS, 80GB 2.5" BTRFS(from old laptop for dockers)

What is your host OS / kernel? We may have outpaced our underlying stack. We're running 3.16.0 in beta 8.

EDIT: Marking "solved" as the OP confirmed this to be an issue with bad RAM.

Quote

[DEFECT] Random Server Crash

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)