Jump to content
We're Hiring! Full Stack Developer ×

Server 'Stuck'


Stucco

Recommended Posts

A drive had write errors yesterday, but I suspected it was because I was just moving things around.  I did a full diag test on it and it came back fine.  I readded it to it's previous spot and started a data rebuild this morning.  It was proceeding as usual.  When I got home after work I could no longer access the web gui, the samba shares, telnet or unmenu.  I can ping by IP but not by hostname.  The 4 2TB drives (largest) are still rebuilding based on their activity lights.

 

I started looking at how to restart the webgui or the hostname or something.  I stumbled across the powerdown scripts and thought they might be handy for when I did have to restart the server, so I went over to the server and logged in to use wget and install the script.  I went to the root dir 'cd /' and typed 'ls' and it came back with one entry, 'Killed'.  I typed 'ls -al' and it booted me and took me back to the login prompt.  Now when I try to login it says "Linux 2.6.32.9-unRAID.  Welcome to Linux 2.6.32.9-unRAID (ttyl)" and then immediately redisplays the login prompt.

 

So now I have no gui, no telnet, no hostname, no direct login, and a drive rebuilding (that appears to still be active), no properly way to shutdown (for when the drive finishes) and no access to the syslog to attach to this post.

 

Hopefully there is a good answer!

Link to comment
  • Replies 56
  • Created
  • Last Reply

A drive had write errors yesterday, but I suspected it was because I was just moving things around.  I did a full diag test on it and it came back fine.  I readded it to it's previous spot and started a data rebuild this morning.  It was proceeding as usual.  When I got home after work I could no longer access the web gui, the samba shares, telnet or unmenu.  I can ping by IP but not by hostname.  The 4 2TB drives (largest) are still rebuilding based on their activity lights.

 

I started looking at how to restart the webgui or the hostname or something.  I stumbled across the powerdown scripts and thought they might be handy for when I did have to restart the server, so I went over to the server and logged in to use wget and install the script.  I went to the root dir 'cd /' and typed 'ls' and it came back with one entry, 'Killed'.  I typed 'ls -al' and it booted me and took me back to the login prompt.  Now when I try to login it says "Linux 2.6.32.9-unRAID.  Welcome to Linux 2.6.32.9-unRAID (ttyl)" and then immediately redisplays the login prompt.

 

So now I have no gui, no telnet, no hostname, no direct login, and a drive rebuilding (that appears to still be active), no properly way to shutdown (for when the drive finishes) and no access to the syslog to attach to this post.

 

Hopefully there is a good answer!

The answer is simple, something has used up all available RAM, and the kernel's out-of-memory process is killing off what it thinks it can to free more.

 

You'll be lucky if you can get anything to cleanly stop unless you can free some RAM.  If it is your syslog, you might be able to do something like:

cp /dev/null /var/log/syslog

 

Of course, that will wipe away any clues as to what filled it as it will be then 0 bytes long.

 

You basically can only reboot... 

 

When something like this happens to me, it usually kills my login session, so I can't do anything.

 

Joe L.

Link to comment

I waited till the activity lights were out and hard shut down the system.  It rebooted, and immediately froze at the login prompt.  I had to hard reboot again.  Upon reboot disk3 activity light has been on solid for 10 minutes (disk2 was the one rebuilding).  Here is the current status:

 

unledtsv.png

 

Attached is my syslog.  Why does disk2 say neg space?  Why is parity check moving at such a slow pace?  Normally It is between 9MB and 30MB / sec.

syslog-2011-11-18.zip

Link to comment

Also, how much memory should I expect unraid to use?  I have 2GB in the server currently.  I thought that would be enough.

 

I do have sabnzbd and sickbeard installed and running off the cache drive, however, they were shut down during the drive rebuild when this memory problem occurred.

Link to comment

Tried to stop the array, it refreshed to show unmounting on all drives and then after about 10 minutes the web gui stopped responding.  I was able to successfully run the powerdown script.

Booted back up and it still showed negative space for drive2.  drive3 was still staying lit significantly longer than other drives.  Tried to stop the array to remove drive2 and rebuild again, but the stop failed again.

Link to comment

Ran memtest this AM, froze after 1 pass.  Removed 1GB stick, leaving 1GB stick in place.  Ran through 4 passes.

 

Removed cache drive from system so unraid and SAB will not start.  Booted system.  Again, disk2 shows negative space and disk3 is reading very slowly (although faster than yesterday).

 

Syslog attached.

 

I suspect I am going to have to rebuild disk2 again, but I'd like to figure out what is wrong with disk3 first.

syslog-2011-11-18_1.zip

Link to comment

Ran memtest this AM, froze after 1 pass.  Removed 1GB stick, leaving 1GB stick in place.  Ran through 4 passes.

 

Removed cache drive from system so unraid and SAB will not start.  Booted system.  Again, disk2 shows negative space and disk3 is reading very slowly (although faster than yesterday).

 

Syslog attached.

 

I suspect I am going to have to rebuild disk2 again, but I'd like to figure out what is wrong with disk3 first.

Removing the cache drive may not stop sabnzbd from starting.

Link to comment

I ran the full diagnostics on disk3 and it came back clean.

 

I put a spare drive in place of disk2, brought up the server and started a data rebuild.  It still showed the same negative free space for disk2, so I don't think a disk2 rebuild is going to fix this problem.  I stopped the rebuild.  Please advise.

Link to comment

5 or 6 times in a row now I have booted and started the rebuild of disk2.  It reaches maybe 3-6% and then completely locks up.

 

I have tried replacing disk2 with the original drive and a second drive.  It is the same result in both cases.

 

I ran memtest all night and had 25 passes, 0 failures.

Link to comment

Linux 2.6.32.9-unRAID.
root@Tower:~# tail -f /var/log/syslog
Nov 27 22:07:21 Tower emhttp: shcmd (26): cp /etc/exports- /etc/exports
Nov 27 22:07:21 Tower emhttp: shcmd (27): mkdir /mnt/user
Nov 27 22:07:21 Tower emhttp: shcmd (28): /usr/local/sbin/shfs /mnt/user  -o noatime,b
Nov 27 22:07:22 Tower emhttp: get_config_idx: fopen /boot/config/shares/incomplete.cfg
Nov 27 22:07:22 Tower emhttp: shcmd (29): killall -HUP smbd
Nov 27 22:07:22 Tower emhttp: shcmd (30): /etc/rc.d/rc.nfsd restart | logger
Nov 27 22:08:01 Tower in.telnetd[1654]: connect from 192.168.1.101 (192.168.1.101)
Nov 27 22:08:05 Tower login[1655]: ROOT LOGIN  on `pts/0' from `192.168.1.101'
Nov 27 22:08:38 Tower unmenu-status: Starting unmenu web-server
Nov 27 22:08:38 Tower init: Re-reading inittab

 

All of this came up near boot time.  Nothing else came up before it froze a few minutes later.

 

Specs:

MB: ASUS P5N-E SLI http://www.newegg.com/Product/Product.aspx?Item=N82E16813131142

CPU: Intel Core 2 Duo E4300 http://www.newegg.com/Product/Product.aspx?Item=N82E16819115013

RAM: 1GB PQI MAD42GUOE-X2 http://www.newegg.com/Product/Product.aspx?Item=N82E16820141345 - 1GB in, 1GB removed

Controller Card: Rosewill RC-218 http://www.newegg.com/Product/Product.aspx?Item=N82E16816132018

Hard Drive Array: 2 X Supermicro CSE-M35T-1B 5x3 http://www.newegg.com/Product/Product.aspx?Item=N82E16817121405

Video: Diamond Stealth 3D 2000 Pro 2MB+ v3.01 (PCI vid card, very old)

Power Supply: Rosewill RV2-700 http://www.newegg.com/Product/Product.aspx?Item=N82E16817182173

Case: AZZA Helios 910 http://www.newegg.com/Product/Product.aspx?Item=N82E16811517007

CPU Fan: Arctic Freezer Pro 7 http://www.newegg.com/Product/Product.aspx?Item=N82E16835186134

 

I booted it again to see if I could catch anything else in the syslog.  It did not complete the boot cycle.  It died while samba was starting.

 

CPU temp looks fine so I don't think it is an overheating problem (33C).

 

Thank you for the help!!!!

Link to comment

After a hard shutdown unRaid (Linux) is trying to replay journaled transactions to the disks. While this is happening, depending on your version, the web GUI can be unresponsive for 1/2 hour or more. If you have unmenu installed, you CAN bring that up and see the array status.

 

I believe the lengthy delays are partially caused by running the parity check at the same time journaled transactions are being replayed, but I'd recommend giving the system an extended time to come up (at least 1/2 hour) before forcing a second hard reboot.

Link to comment

I removed all disks from the supermicro arrays and booted the system.  It locked within 20 minutes.  I can't imagine it is a bad supermicro backplane after that.

 

I unplugged the power to all disks and booted the system.  It ran for 1.5 hours and did not lock up.

 

I plugged in the power to parity, disk1 and disk2 and just booted the system.  I will reply with the result.

Link to comment

Normally I have 7 drives.  6 of them are WD Caviar Green and my cache is an old 300GB Maxtor of which I am not sure the speed.  I have enough space to reduce the number of drives if needed, however I don't think I can until I get through the rebuild.  I have a spare PSU sitting around, a LC-A350ATX and some standard -> sata power converters, but I don't have enough to do all the drives.

 

I'm not sure exactly what you mean by the double vs. single 12 volt rail.  Can you elaborate a little so I know what I need to replace my existing one with?

 

Thank you!!

Link to comment

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...