February 19, 201115 yr I think I may be losing my mind here. My Unraid server seems to crash randomly, fairly frequently, and completely - i.e. no console control at all, not even responsiveness from the keyboard, which is the most frustrating part! I've run memtest overnight with zero errors, and my drives haven't shown any errors either. And because it's a full on crash, I can't even recoup the syslog file. I'm starting to think it may be motherboard related, but I don't know how to check for sure. Anyone have any insight?
February 19, 201115 yr See this post: http://lime-technology.com/forum/index.php?topic=9880.0 Capture your syslog in a telnet window with the tail command. Then we can see what happens just before it crashes.
February 19, 201115 yr Author Okay, so here's a crash that just happened. I'm using Unraid 4.7 and had just set the time on the machine. I think went to check out the cache drive and realized that the mover script hadn't invoked overnight as it was supposed to, so initiated a manual move through the standard WebGUI. During the move, the machine became unresponsive, so I wanted to go the main page of the WebGUI to see if there was actually any movement happening, and the server crashed. Tower login: root Linux 2.6.32.9-unRAID. root@Tower:~# tail -f /var/log/syslog Feb 19 06:45:02 Tower emhttp: shcmd (36): /usr/sbin/hdparm -S0 /dev/sdb >/dev/null Feb 19 06:45:02 Tower kernel: mdcmd (16): spinup 0 Feb 19 06:45:02 Tower kernel: mdcmd (17): spinup 1 Feb 19 06:45:02 Tower kernel: mdcmd (18): spinup 2 Feb 19 07:45:06 Tower kernel: mdcmd (19): spindown 0 Feb 19 07:45:07 Tower kernel: mdcmd (20): spindown 1 Feb 19 07:45:08 Tower kernel: mdcmd (21): spindown 2 Feb 19 15:00:03 Tower in.telnetd[2036]: connect from 192.168.1.100 (192.168.1.100) Feb 19 15:00:05 Tower login[2037]: invalid password for `UNKNOWN' on `pts/0' from `192.168.1.100' Feb 19 15:00:07 Tower login[2037]: ROOT LOGIN on `pts/0' from `192.168.1.100' Feb 19 15:00:59 Tower emhttp: Spinning up all drives... Feb 19 15:00:59 Tower emhttp: shcmd (37): /usr/sbin/hdparm -S0 /dev/sdb >/dev/null Feb 19 15:00:59 Tower kernel: mdcmd (22): spinup 0 Feb 19 15:00:59 Tower kernel: mdcmd (23): spinup 1 Feb 19 15:00:59 Tower kernel: mdcmd (24): spinup 2 Feb 19 15:01:49 Tower emhttp: shcmd (38): ln -sf /usr/share/zoneinfo/America/New_York /etc/localtime-copied-from Feb 19 15:01:49 Tower emhttp: shcmd (39): cp /etc/localtime-copied-from /etc/localtime Feb 19 10:01:49 Tower emhttp: shcmd (40): /etc/rc.d/rc.ntpd restart 2>&1 | logger Feb 19 10:01:50 Tower ntpd[2075]: ntpd [email protected] Wed Jan 14 23:46:25 UTC 2009 (1) Feb 19 10:01:50 Tower logger: Stopping NTP daemon...Starting NTP daemon: /usr/sbin/ntpd -g Feb 19 10:01:50 Tower ntpd[2076]: precision = 1.000 usec Feb 19 10:01:50 Tower ntpd[2076]: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16 Feb 19 10:01:50 Tower ntpd[2076]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled Feb 19 10:01:50 Tower ntpd[2076]: Listening on interface #1 lo, 127.0.0.1#123 Enabled Feb 19 10:01:50 Tower ntpd[2076]: Listening on interface #2 eth0, 192.168.1.107#123 Enabled Feb 19 10:01:50 Tower ntpd[2076]: kernel time sync status 0040 Feb 19 10:01:28 Tower ntpd[2076]: synchronized to 216.234.161.11, stratum 2 Feb 19 10:01:28 Tower ntpd[2076]: time reset -31.759717 s Feb 19 10:01:28 Tower emhttp: shcmd (41): /usr/sbin/hdparm -y /dev/sdb >/dev/null Feb 19 10:01:32 Tower emhttp: shcmd (42): /usr/local/sbin/set_ncq sda 1 >/dev/null Feb 19 10:01:32 Tower emhttp: shcmd (43): /usr/local/sbin/set_ncq sdd 1 >/dev/null Feb 19 10:01:32 Tower emhttp: shcmd (44): /usr/local/sbin/set_ncq sdc 1 >/dev/null Feb 19 10:01:32 Tower emhttp: shcmd (45): /usr/local/sbin/set_ncq sdb 1 >/dev/null Feb 19 10:01:32 Tower kernel: mdcmd (25): set md_num_stripes 1280 Feb 19 10:01:32 Tower kernel: mdcmd (26): set md_write_limit 768 Feb 19 10:01:32 Tower kernel: mdcmd (27): set md_sync_window 288 Feb 19 10:01:32 Tower kernel: mdcmd (28): set spinup_group 0 0 Feb 19 10:01:32 Tower kernel: mdcmd (29): set spinup_group 1 0 Feb 19 10:01:32 Tower kernel: mdcmd (30): set spinup_group 2 0 Feb 19 10:05:47 Tower ntpd[2076]: synchronized to 216.234.161.11, stratum 2 Feb 19 10:10:36 Tower emhttp: shcmd (46): /usr/local/sbin/mover 2>&1 | logger & Feb 19 10:10:36 Tower logger: mover started Feb 19 10:10:36 Tower logger: ./MOVIES/Movies to Move/Up/Up.mkv Feb 19 10:10:36 Tower logger: .d..t...... ./ Feb 19 10:10:36 Tower logger: .d..t...... MOVIES/ Feb 19 10:10:36 Tower logger: .d..t...... MOVIES/Movies to Move/ Feb 19 10:10:36 Tower logger: cd+++++++++ MOVIES/Movies to Move/Up/ Feb 19 10:10:36 Tower logger: >f+++++++++ MOVIES/Movies to Move/Up/Up.mkv Connection closed by foreign host.
February 20, 201115 yr Author Okay, so I just had another crash, and this time the logger caught it. Looks like I ran out of memory? My machine has 2 Gb minus 128Mb for the onboard video. When this crash occurred, I had opened a telnet session to move a folder from one location on the cache drive to a second location on the cache drive - it was a video file, so it was on the larger side (if that makes a difference). I've attached the syslog in .zip format. syslog.zip
February 20, 201115 yr Author And yet another crash while we had a movie streaming from the server... This is getting to be quite frustrating. Last weekend I did a fresh install of Unraid on the flash drive, I ran memtest overnight, I checked all the cables (sata, network, etc.), moved off of the switch and onto the router, and still, I'm getting crashes. What gives?!
February 20, 201115 yr Did you disable any and all addons first? You could have something that's writing to the in-memory ram-based filesystem. That could easily explain your out of memory conditions. Please post the contents of your "GO" file for verification.
February 20, 201115 yr you are running out of memory constantly which is crashing the server can you tell us which add on's you are using ? any process in TOP which is running high ? EDIT: typing at same time as BRiT
February 20, 201115 yr Author Re: Addons: UnMenu (invoked by me) Clean Powerdown (UnMenu package) SABNzbd CouchPotato Sickbeard Here's my Go Script: #!/bin/bash # Start the Management Utility /usr/local/sbin/emhttp & # determine if cache drive online, retry upto 7 times for i in 0 1 2 3 4 5 6 7 do if [ ! -d /mnt/cache ] then sleep 15 fi done # If Cache drive is online, start SABnzbd, Sickbeard, and CouchPotato if [ -d /mnt/cache ]; then cd /mnt/cache/.custom installpkg /boot/packages/SABnzbdDependencies-2.1-i486-unRAID.tgz python /mnt/cache/.custom/sabnzbd/SABnzbd.py -d python /mnt/cache/.custom/sickbeard/SickBeard.py --daemon python /mnt/cache/.custom/couchpotato/CouchPotato.py -d fi cd /boot/packages && find . -name '*.auto_install' -type f -print | sort | xargs -n1 sh -c
February 20, 201115 yr Author In the interest of providing as much information as possible, here's a "Fresh Syslog" taken right after I restarted the machine this morning (I canceled parity check). Syslog_Fresh.zip
February 21, 201115 yr Author Would adding more RAM remedy my issues, or is there a chance I could still run out of memory?
February 21, 201115 yr Would adding more RAM remedy my issues, or is there a chance I could still run out of memory? If a process is allocating memory, but not freeing it, it will run out of memory eventually anyways. It will just take longer. Joe L.
February 21, 201115 yr Author So any advice, tips, troubleshoot, or am I doomed to have to continue hard resetting this thing twice a day? EDIT - okay, so I did some searching, should I just try a completely fresh install for a while?
February 21, 201115 yr So any advice, tips, troubleshoot, or am I doomed to have to continue hard resetting this thing twice a day? EDIT - okay, so I did some searching, should I just try a completely fresh install for a while? Your best bet is to comment out all but the emhttp line in your "go" script, do NOT run the add-ons, and see if you still crash. Then, if not, you can one by one enable them and determine which it is eating up your memory. Joe L.
February 21, 201115 yr Author Sorry, got impatient and went ahead with a clean format on the flash drive. Got the server up and running without any add-ons whatsoever, got my shares setup, and then ran TOP. While I had TOP running, I went through the web gui to manually move the files from my cache drive to my array, and, naturally, crashed. Here's a screenshot of the TOP when I crashed. How is it possible that the Move function can cause me to crash so badly? The files are large (avg of 4GB) but I thought the system would be able to manage. Does this mean I should get rid of the cache drive all together?
February 21, 201115 yr i have been thinking about your case for some time now ... but i have no clue what is happening to your machine i move constantly 8gb + files from my cache to my array never have problems my top is much more busy then yours... when rsyncing he sometimes goes up to 7 only thing i can think off is that you are somehow moving stuff to your memory somehow and not to your disks but i have no clue how that could happen... so i will have to leave you in the hands of more experienced people on the forum maybe for good order list your setup components ... maybe there is a known problem with your hardware ...
February 21, 201115 yr I like how you are now trying to isolate the cause of the crash by disabling add-ons. It is a very smart way to go about it. The only time I've seen a consistent crash when accessing a specific disk (either for reading or writing) is when the file-system on it had some corruption. You can easily un-assign the cache and eliminate it as a possibility. It will let you see if the remaining components in your server can run without crashing. You can also perform a file-system-check of each of the file-systems on your disks (all but the parity disk, as it has no file-system). To do that you would need to follow the steps outlined here in the wiki: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems basically, after stopping samba and un-mounting the disk being checked you would type reiserfsck --check /dev/md1 and then reiserfsck --check /dev/md2 etc. Joe L.
February 21, 201115 yr Author I'll try your suggestion Joe. I'm wondering if the cache drive is just bad. As for hardware: Motherboard: Gigabyte MA790GPT-UD3H (HPA disabled) CPU: AMD Athlon II x4 630 (not overclocked) RAM: G Skill 2GB (don't remember model type...I've run memtest, and it's good though) PSU: Corsair HX750 Case: Norco 4020 Flash: 8GB SanDisk Cruzer Parity: 2TB WD20EADS Drive 1: 2TB WD20EADS Drive 2: 2TB WD20EARS Cache: 300GB Seagate ST332
February 23, 201115 yr Author I like how you are now trying to isolate the cause of the crash by disabling add-ons. It is a very smart way to go about it. The only time I've seen a consistent crash when accessing a specific disk (either for reading or writing) is when the file-system on it had some corruption. You can easily un-assign the cache and eliminate it as a possibility. It will let you see if the remaining components in your server can run without crashing. You can also perform a file-system-check of each of the file-systems on your disks (all but the parity disk, as it has no file-system). To do that you would need to follow the steps outlined here in the wiki: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems basically, after stopping samba and un-mounting the disk being checked you would type reiserfsck --check /dev/md1 and then reiserfsck --check /dev/md2 etc. Joe L. All disks checked out fine. Ran a SMART report on them as well, and they all checked out.
February 24, 201115 yr Author So figure this one out, completely bare install of Unraid 4.7 that was not connected to the network at all - i.e. no patch cable connected - set the console to pre-clear the cache drive (since SMART came back clean) and it froze/crashed/hung/became totally unresponsive. This HAS to be a hardware issue right!? But WHICH hardware do I start with? I don't have extra RAM or an extra Motherboard lying around. Arrrgghhhh! >
February 24, 201115 yr So figure this one out, completely bare install of Unraid 4.7 that was not connected to the network at all - i.e. no patch cable connected - set the console to pre-clear the cache drive (since SMART came back clean) and it froze/crashed/hung/became totally unresponsive. This HAS to be a hardware issue right!? But WHICH hardware do I start with? I don't have extra RAM or an extra Motherboard lying around. Arrrgghhhh! > Start with memory. If it is not set up correctly, with the correct voltage, clock speed, and timing all bets are off. Some BIOS get it right, many do not, especially with premium RAM. After checking the memory voltage, timing, and clock speed is set correctly in the BIOS, run a memory test, overnight.
March 5, 201115 yr Author I think I may have found the culprit, but unsure of how to fix it. I had kept all devices except my main computer from accessing the Server and all was good. After a few days of running without any problems, I thought I would try to watch a movie using my WDLX-TV Live (using brad's firmware upgrade). Well, that only seems to be able to read shares as NFS. So, I setup the NFS shares with *(rw) and then the crashes began again. I'm guessing something from the WDLX-TV Live is writing or "communicating" with the Server and causing these crashes. I tried to setup the shares as read only - I assumed the correct command was *®, but that didn't work, and neither did *(w). So now I'm a bit stumped; the WDLX-TV Live is my main media player and primary reason why I'm using the Unraid server, so I'm looking for any advice anyone may have. For reference, I haven't reconnected my cache drive, and am running no other add-ons.
March 6, 201115 yr After an overnight memory test passes, capture the syslog again. It might take a few tries but eventually we should see the cause.
Archived
This topic is now archived and is closed to further replies.