Frequent Total Crash?

February 19, 201115 yr

I think I may be losing my mind here. My Unraid server seems to crash randomly, fairly frequently, and completely - i.e. no console control at all, not even responsiveness from the keyboard, which is the most frustrating part! I've run memtest overnight with zero errors, and my drives haven't shown any errors either. And because it's a full on crash, I can't even recoup the syslog file. I'm starting to think it may be motherboard related, but I don't know how to check for sure.

Anyone have any insight?

Quote

February 19, 201115 yr

See this post: http://lime-technology.com/forum/index.php?topic=9880.0

Capture your syslog in a telnet window with the tail command. Then we can see what happens just before it crashes.

Quote

February 19, 201115 yr

Author

Okay, so here's a crash that just happened. I'm using Unraid 4.7 and had just set the time on the machine. I think went to check out the cache drive and realized that the mover script hadn't invoked overnight as it was supposed to, so initiated a manual move through the standard WebGUI. During the move, the machine became unresponsive, so I wanted to go the main page of the WebGUI to see if there was actually any movement happening, and the server crashed.

Tower login: root

Linux 2.6.32.9-unRAID.

root@Tower:~# tail -f /var/log/syslog

Feb 19 06:45:02 Tower emhttp: shcmd (36): /usr/sbin/hdparm -S0 /dev/sdb >/dev/null

Feb 19 06:45:02 Tower kernel: mdcmd (16): spinup 0

Feb 19 06:45:02 Tower kernel: mdcmd (17): spinup 1

Feb 19 06:45:02 Tower kernel: mdcmd (18): spinup 2

Feb 19 07:45:06 Tower kernel: mdcmd (19): spindown 0

Feb 19 07:45:07 Tower kernel: mdcmd (20): spindown 1

Feb 19 07:45:08 Tower kernel: mdcmd (21): spindown 2

Feb 19 15:00:03 Tower in.telnetd[2036]: connect from 192.168.1.100 (192.168.1.100)

Feb 19 15:00:05 Tower login[2037]: invalid password for `UNKNOWN' on `pts/0' from `192.168.1.100'

Feb 19 15:00:07 Tower login[2037]: ROOT LOGIN on `pts/0' from `192.168.1.100'

Feb 19 15:00:59 Tower emhttp: Spinning up all drives...

Feb 19 15:00:59 Tower emhttp: shcmd (37): /usr/sbin/hdparm -S0 /dev/sdb >/dev/null

Feb 19 15:00:59 Tower kernel: mdcmd (22): spinup 0

Feb 19 15:00:59 Tower kernel: mdcmd (23): spinup 1

Feb 19 15:00:59 Tower kernel: mdcmd (24): spinup 2

Feb 19 15:01:49 Tower emhttp: shcmd (38): ln -sf /usr/share/zoneinfo/America/New_York /etc/localtime-copied-from

Feb 19 15:01:49 Tower emhttp: shcmd (39): cp /etc/localtime-copied-from /etc/localtime

Feb 19 10:01:49 Tower emhttp: shcmd (40): /etc/rc.d/rc.ntpd restart 2>&1 | logger

Feb 19 10:01:50 Tower ntpd[2075]: ntpd [email protected] Wed Jan 14 23:46:25 UTC 2009 (1)

Feb 19 10:01:50 Tower logger: Stopping NTP daemon...Starting NTP daemon: /usr/sbin/ntpd -g

Feb 19 10:01:50 Tower ntpd[2076]: precision = 1.000 usec

Feb 19 10:01:50 Tower ntpd[2076]: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16

Feb 19 10:01:50 Tower ntpd[2076]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled

Feb 19 10:01:50 Tower ntpd[2076]: Listening on interface #1 lo, 127.0.0.1#123 Enabled

Feb 19 10:01:50 Tower ntpd[2076]: Listening on interface #2 eth0, 192.168.1.107#123 Enabled

Feb 19 10:01:50 Tower ntpd[2076]: kernel time sync status 0040

Feb 19 10:01:28 Tower ntpd[2076]: synchronized to 216.234.161.11, stratum 2

Feb 19 10:01:28 Tower ntpd[2076]: time reset -31.759717 s

Feb 19 10:01:28 Tower emhttp: shcmd (41): /usr/sbin/hdparm -y /dev/sdb >/dev/null

Feb 19 10:01:32 Tower emhttp: shcmd (42): /usr/local/sbin/set_ncq sda 1 >/dev/null

Feb 19 10:01:32 Tower emhttp: shcmd (43): /usr/local/sbin/set_ncq sdd 1 >/dev/null

Feb 19 10:01:32 Tower emhttp: shcmd (44): /usr/local/sbin/set_ncq sdc 1 >/dev/null

Feb 19 10:01:32 Tower emhttp: shcmd (45): /usr/local/sbin/set_ncq sdb 1 >/dev/null

Feb 19 10:01:32 Tower kernel: mdcmd (25): set md_num_stripes 1280

Feb 19 10:01:32 Tower kernel: mdcmd (26): set md_write_limit 768

Feb 19 10:01:32 Tower kernel: mdcmd (27): set md_sync_window 288

Feb 19 10:01:32 Tower kernel: mdcmd (28): set spinup_group 0 0

Feb 19 10:01:32 Tower kernel: mdcmd (29): set spinup_group 1 0

Feb 19 10:01:32 Tower kernel: mdcmd (30): set spinup_group 2 0

Feb 19 10:05:47 Tower ntpd[2076]: synchronized to 216.234.161.11, stratum 2

Feb 19 10:10:36 Tower emhttp: shcmd (46): /usr/local/sbin/mover 2>&1 | logger &

Feb 19 10:10:36 Tower logger: mover started

Feb 19 10:10:36 Tower logger: ./MOVIES/Movies to Move/Up/Up.mkv

Feb 19 10:10:36 Tower logger: .d..t...... ./

Feb 19 10:10:36 Tower logger: .d..t...... MOVIES/

Feb 19 10:10:36 Tower logger: .d..t...... MOVIES/Movies to Move/

Feb 19 10:10:36 Tower logger: cd+++++++++ MOVIES/Movies to Move/Up/

Feb 19 10:10:36 Tower logger: >f+++++++++ MOVIES/Movies to Move/Up/Up.mkv

Connection closed by foreign host.

Quote

February 20, 201115 yr

Author

Okay, so I just had another crash, and this time the logger caught it. Looks like I ran out of memory? My machine has 2 Gb minus 128Mb for the onboard video.

When this crash occurred, I had opened a telnet session to move a folder from one location on the cache drive to a second location on the cache drive - it was a video file, so it was on the larger side (if that makes a difference).

I've attached the syslog in .zip format.

syslog.zip

Quote

February 20, 201115 yr

Author

And yet another crash while we had a movie streaming from the server...

This is getting to be quite frustrating. Last weekend I did a fresh install of Unraid on the flash drive, I ran memtest overnight, I checked all the cables (sata, network, etc.), moved off of the switch and onto the router, and still, I'm getting crashes. What gives?!

Quote

February 20, 201115 yr

Did you disable any and all addons first? You could have something that's writing to the in-memory ram-based filesystem. That could easily explain your out of memory conditions.

Please post the contents of your "GO" file for verification.

Quote

February 20, 201115 yr

you are running out of memory constantly

which is crashing the server

can you tell us which add on's you are using ?

any process in TOP which is running high ?

EDIT: typing at same time as BRiT

Quote

February 20, 201115 yr

Author

Re: Addons:

UnMenu (invoked by me)

Clean Powerdown (UnMenu package)

SABNzbd

CouchPotato

Sickbeard

Here's my Go Script:

#!/bin/bash

# Start the Management Utility

/usr/local/sbin/emhttp &

# determine if cache drive online, retry upto 7 times

for i in 0 1 2 3 4 5 6 7

do

if [ ! -d /mnt/cache ]

then

sleep 15

fi

done

# If Cache drive is online, start SABnzbd, Sickbeard, and CouchPotato

if [ -d /mnt/cache ]; then

cd /mnt/cache/.custom

installpkg /boot/packages/SABnzbdDependencies-2.1-i486-unRAID.tgz

python /mnt/cache/.custom/sabnzbd/SABnzbd.py -d

python /mnt/cache/.custom/sickbeard/SickBeard.py --daemon

python /mnt/cache/.custom/couchpotato/CouchPotato.py -d

fi

cd /boot/packages && find . -name '*.auto_install' -type f -print | sort | xargs -n1 sh -c

Quote

February 20, 201115 yr

Author

In the interest of providing as much information as possible, here's a "Fresh Syslog" taken right after I restarted the machine this morning (I canceled parity check).

Syslog_Fresh.zip

Quote

February 21, 201115 yr

Author

Would adding more RAM remedy my issues, or is there a chance I could still run out of memory?

Quote

February 21, 201115 yr

Would adding more RAM remedy my issues, or is there a chance I could still run out of memory?

If a process is allocating memory, but not freeing it, it will run out of memory eventually anyways. It will just take longer.

Joe L.

Quote

February 21, 201115 yr

Author

So any advice, tips, troubleshoot, or am I doomed to have to continue hard resetting this thing twice a day?

EDIT - okay, so I did some searching, should I just try a completely fresh install for a while?

Quote

February 21, 201115 yr

So any advice, tips, troubleshoot, or am I doomed to have to continue hard resetting this thing twice a day?

EDIT - okay, so I did some searching, should I just try a completely fresh install for a while?

Your best bet is to comment out all but the emhttp line in your "go" script, do NOT run the add-ons, and see if you still crash. Then, if not, you can one by one enable them and determine which it is eating up your memory.

Joe L.

Quote

February 21, 201115 yr

Author

Sorry, got impatient and went ahead with a clean format on the flash drive. Got the server up and running without any add-ons whatsoever, got my shares setup, and then ran TOP. While I had TOP running, I went through the web gui to manually move the files from my cache drive to my array, and, naturally, crashed. Here's a screenshot of the TOP when I crashed.

How is it possible that the Move function can cause me to crash so badly? The files are large (avg of 4GB) but I thought the system would be able to manage. Does this mean I should get rid of the cache drive all together?

Quote

February 21, 201115 yr

i have been thinking about your case for some time now ... but i have no clue what is happening to your machine

i move constantly 8gb + files from my cache to my array

never have problems

my top is much more busy then yours... when rsyncing he sometimes goes up to 7

only thing i can think off is that you are somehow moving stuff to your memory somehow and not to your disks

but i have no clue how that could happen...

so i will have to leave you in the hands of more experienced people on the forum

maybe for good order list your setup components ...

maybe there is a known problem with your hardware ...

Quote

February 21, 201115 yr

I like how you are now trying to isolate the cause of the crash by disabling add-ons. It is a very smart way to go about it.

The only time I've seen a consistent crash when accessing a specific disk (either for reading or writing) is when the file-system on it had some corruption.

You can easily un-assign the cache and eliminate it as a possibility. It will let you see if the remaining components in your server can run without crashing.

You can also perform a file-system-check of each of the file-systems on your disks (all but the parity disk, as it has no file-system). To do that you would need to follow the steps outlined here in the wiki:

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

basically, after stopping samba and un-mounting the disk being checked you would type

reiserfsck --check /dev/md1

and then

reiserfsck --check /dev/md2

etc.

Joe L.

Quote

February 21, 201115 yr

Author

I'll try your suggestion Joe. I'm wondering if the cache drive is just bad.

As for hardware:

Motherboard: Gigabyte MA790GPT-UD3H (HPA disabled)

CPU: AMD Athlon II x4 630 (not overclocked)

RAM: G Skill 2GB (don't remember model type...I've run memtest, and it's good though)

PSU: Corsair HX750

Case: Norco 4020

Flash: 8GB SanDisk Cruzer

Parity: 2TB WD20EADS

Drive 1: 2TB WD20EADS

Drive 2: 2TB WD20EARS

Cache: 300GB Seagate ST332

Quote

February 23, 201115 yr

Author

I like how you are now trying to isolate the cause of the crash by disabling add-ons. It is a very smart way to go about it.

The only time I've seen a consistent crash when accessing a specific disk (either for reading or writing) is when the file-system on it had some corruption.

You can easily un-assign the cache and eliminate it as a possibility. It will let you see if the remaining components in your server can run without crashing.

You can also perform a file-system-check of each of the file-systems on your disks (all but the parity disk, as it has no file-system). To do that you would need to follow the steps outlined here in the wiki:

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

basically, after stopping samba and un-mounting the disk being checked you would type

reiserfsck --check /dev/md1

and then

reiserfsck --check /dev/md2

etc.

Joe L.

All disks checked out fine. Ran a SMART report on them as well, and they all checked out.

Quote

February 24, 201115 yr

Author

So figure this one out, completely bare install of Unraid 4.7 that was not connected to the network at all - i.e. no patch cable connected - set the console to pre-clear the cache drive (since SMART came back clean) and it froze/crashed/hung/became totally unresponsive.

This HAS to be a hardware issue right!? But WHICH hardware do I start with? I don't have extra RAM or an extra Motherboard lying around. Arrrgghhhh! >

Quote

February 24, 201115 yr

So figure this one out, completely bare install of Unraid 4.7 that was not connected to the network at all - i.e. no patch cable connected - set the console to pre-clear the cache drive (since SMART came back clean) and it froze/crashed/hung/became totally unresponsive.

This HAS to be a hardware issue right!? But WHICH hardware do I start with? I don't have extra RAM or an extra Motherboard lying around. Arrrgghhhh! >

Start with memory. If it is not set up correctly, with the correct voltage, clock speed, and timing all bets are off. Some BIOS get it right, many do not, especially with premium RAM.

After checking the memory voltage, timing, and clock speed is set correctly in the BIOS, run a memory test, overnight.

Quote

March 5, 201115 yr

Author

I think I may have found the culprit, but unsure of how to fix it. I had kept all devices except my main computer from accessing the Server and all was good. After a few days of running without any problems, I thought I would try to watch a movie using my WDLX-TV Live (using brad's firmware upgrade). Well, that only seems to be able to read shares as NFS. So, I setup the NFS shares with *(rw) and then the crashes began again.

I'm guessing something from the WDLX-TV Live is writing or "communicating" with the Server and causing these crashes. I tried to setup the shares as read only - I assumed the correct command was *®, but that didn't work, and neither did *(w). So now I'm a bit stumped; the WDLX-TV Live is my main media player and primary reason why I'm using the Unraid server, so I'm looking for any advice anyone may have.

For reference, I haven't reconnected my cache drive, and am running no other add-ons.

Quote

March 6, 201115 yr

After an overnight memory test passes, capture the syslog again. It might take a few tries but eventually we should see the cause.

Quote

Frequent Total Crash?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)