Jump to content

Server 'Stuck'

Featured Replies

Posted

A drive had write errors yesterday, but I suspected it was because I was just moving things around.  I did a full diag test on it and it came back fine.  I readded it to it's previous spot and started a data rebuild this morning.  It was proceeding as usual.  When I got home after work I could no longer access the web gui, the samba shares, telnet or unmenu.  I can ping by IP but not by hostname.  The 4 2TB drives (largest) are still rebuilding based on their activity lights.

 

I started looking at how to restart the webgui or the hostname or something.  I stumbled across the powerdown scripts and thought they might be handy for when I did have to restart the server, so I went over to the server and logged in to use wget and install the script.  I went to the root dir 'cd /' and typed 'ls' and it came back with one entry, 'Killed'.  I typed 'ls -al' and it booted me and took me back to the login prompt.  Now when I try to login it says "Linux 2.6.32.9-unRAID.  Welcome to Linux 2.6.32.9-unRAID (ttyl)" and then immediately redisplays the login prompt.

 

So now I have no gui, no telnet, no hostname, no direct login, and a drive rebuilding (that appears to still be active), no properly way to shutdown (for when the drive finishes) and no access to the syslog to attach to this post.

 

Hopefully there is a good answer!

  • Replies 56
  • Views 9.9k
  • Created
  • Last Reply

A drive had write errors yesterday, but I suspected it was because I was just moving things around.  I did a full diag test on it and it came back fine.  I readded it to it's previous spot and started a data rebuild this morning.  It was proceeding as usual.  When I got home after work I could no longer access the web gui, the samba shares, telnet or unmenu.  I can ping by IP but not by hostname.  The 4 2TB drives (largest) are still rebuilding based on their activity lights.

 

I started looking at how to restart the webgui or the hostname or something.  I stumbled across the powerdown scripts and thought they might be handy for when I did have to restart the server, so I went over to the server and logged in to use wget and install the script.  I went to the root dir 'cd /' and typed 'ls' and it came back with one entry, 'Killed'.  I typed 'ls -al' and it booted me and took me back to the login prompt.  Now when I try to login it says "Linux 2.6.32.9-unRAID.  Welcome to Linux 2.6.32.9-unRAID (ttyl)" and then immediately redisplays the login prompt.

 

So now I have no gui, no telnet, no hostname, no direct login, and a drive rebuilding (that appears to still be active), no properly way to shutdown (for when the drive finishes) and no access to the syslog to attach to this post.

 

Hopefully there is a good answer!

The answer is simple, something has used up all available RAM, and the kernel's out-of-memory process is killing off what it thinks it can to free more.

 

You'll be lucky if you can get anything to cleanly stop unless you can free some RAM.  If it is your syslog, you might be able to do something like:

cp /dev/null /var/log/syslog

 

Of course, that will wipe away any clues as to what filled it as it will be then 0 bytes long.

 

You basically can only reboot... 

 

When something like this happens to me, it usually kills my login session, so I can't do anything.

 

Joe L.

  • Author

Is it likely that my drive rebuild will finish?  The activity lights are still going like crazy.

  • Author

I waited till the activity lights were out and hard shut down the system.  It rebooted, and immediately froze at the login prompt.  I had to hard reboot again.  Upon reboot disk3 activity light has been on solid for 10 minutes (disk2 was the one rebuilding).  Here is the current status:

 

unledtsv.png

 

Attached is my syslog.  Why does disk2 say neg space?  Why is parity check moving at such a slow pace?  Normally It is between 9MB and 30MB / sec.

syslog-2011-11-18.zip

  • Author

Also, how much memory should I expect unraid to use?  I have 2GB in the server currently.  I thought that would be enough.

 

I do have sabnzbd and sickbeard installed and running off the cache drive, however, they were shut down during the drive rebuild when this memory problem occurred.

  • Author

Anytime I do anything that hits the drives like refresh the main web gui page all the drives flash activity lights, but drive3 stays lit for much longer (1-5s) and then the page responds.

Also, how much memory should I expect unraid to use?  I have 2GB in the server currently.  I thought that would be enough.

I run in 512 Meg.  It is plenty for basic file server operations.
  • Author

Tried to stop the array, it refreshed to show unmounting on all drives and then after about 10 minutes the web gui stopped responding.  I was able to successfully run the powerdown script.

Booted back up and it still showed negative space for drive2.  drive3 was still staying lit significantly longer than other drives.  Tried to stop the array to remove drive2 and rebuild again, but the stop failed again.

  • Author

Ran memtest this AM, froze after 1 pass.  Removed 1GB stick, leaving 1GB stick in place.  Ran through 4 passes.

 

Removed cache drive from system so unraid and SAB will not start.  Booted system.  Again, disk2 shows negative space and disk3 is reading very slowly (although faster than yesterday).

 

Syslog attached.

 

I suspect I am going to have to rebuild disk2 again, but I'd like to figure out what is wrong with disk3 first.

syslog-2011-11-18_1.zip

Ran memtest this AM, froze after 1 pass.  Removed 1GB stick, leaving 1GB stick in place.  Ran through 4 passes.

 

Removed cache drive from system so unraid and SAB will not start.  Booted system.  Again, disk2 shows negative space and disk3 is reading very slowly (although faster than yesterday).

 

Syslog attached.

 

I suspect I am going to have to rebuild disk2 again, but I'd like to figure out what is wrong with disk3 first.

Removing the cache drive may not stop sabnzbd from starting.

  • Author

That's how mine is set up.  My go script checks for the existence of /mnt/cache and starts conditionally on that.

 

Anyone have any idea what is going on with disk2?

  • Author

I also have a spare 2TB drive I can swap in if that will help.

  • Author

I ran the full diagnostics on disk3 and it came back clean.

 

I put a spare drive in place of disk2, brought up the server and started a data rebuild.  It still showed the same negative free space for disk2, so I don't think a disk2 rebuild is going to fix this problem.  I stopped the rebuild.  Please advise.

  • Author

I got maybe 10% into the rebuild of disk2 and then my server failed to respond via smb or telnet or direct keyboard interaction.

 

I rebooted and it immediately began the rebuild again.  Syslog is attached.

syslog-2011-11-26.zip

  • Author

5 or 6 times in a row now I have booted and started the rebuild of disk2.  It reaches maybe 3-6% and then completely locks up.

 

I have tried replacing disk2 with the original drive and a second drive.  It is the same result in both cases.

 

I ran memtest all night and had 25 passes, 0 failures.

run a

 

tail -f /var/log/syslog

 

as you are trying the rebuild and see what comes back as the server is crashing.

 

Please also list out ALL of your hardware specs

  • Author

Linux 2.6.32.9-unRAID.
root@Tower:~# tail -f /var/log/syslog
Nov 27 22:07:21 Tower emhttp: shcmd (26): cp /etc/exports- /etc/exports
Nov 27 22:07:21 Tower emhttp: shcmd (27): mkdir /mnt/user
Nov 27 22:07:21 Tower emhttp: shcmd (28): /usr/local/sbin/shfs /mnt/user  -o noatime,b
Nov 27 22:07:22 Tower emhttp: get_config_idx: fopen /boot/config/shares/incomplete.cfg
Nov 27 22:07:22 Tower emhttp: shcmd (29): killall -HUP smbd
Nov 27 22:07:22 Tower emhttp: shcmd (30): /etc/rc.d/rc.nfsd restart | logger
Nov 27 22:08:01 Tower in.telnetd[1654]: connect from 192.168.1.101 (192.168.1.101)
Nov 27 22:08:05 Tower login[1655]: ROOT LOGIN  on `pts/0' from `192.168.1.101'
Nov 27 22:08:38 Tower unmenu-status: Starting unmenu web-server
Nov 27 22:08:38 Tower init: Re-reading inittab

 

All of this came up near boot time.  Nothing else came up before it froze a few minutes later.

 

Specs:

MB: ASUS P5N-E SLI http://www.newegg.com/Product/Product.aspx?Item=N82E16813131142

CPU: Intel Core 2 Duo E4300 http://www.newegg.com/Product/Product.aspx?Item=N82E16819115013

RAM: 1GB PQI MAD42GUOE-X2 http://www.newegg.com/Product/Product.aspx?Item=N82E16820141345 - 1GB in, 1GB removed

Controller Card: Rosewill RC-218 http://www.newegg.com/Product/Product.aspx?Item=N82E16816132018

Hard Drive Array: 2 X Supermicro CSE-M35T-1B 5x3 http://www.newegg.com/Product/Product.aspx?Item=N82E16817121405

Video: Diamond Stealth 3D 2000 Pro 2MB+ v3.01 (PCI vid card, very old)

Power Supply: Rosewill RV2-700 http://www.newegg.com/Product/Product.aspx?Item=N82E16817182173

Case: AZZA Helios 910 http://www.newegg.com/Product/Product.aspx?Item=N82E16811517007

CPU Fan: Arctic Freezer Pro 7 http://www.newegg.com/Product/Product.aspx?Item=N82E16835186134

 

I booted it again to see if I could catch anything else in the syslog.  It did not complete the boot cycle.  It died while samba was starting.

 

CPU temp looks fine so I don't think it is an overheating problem (33C).

 

Thank you for the help!!!!

After a hard shutdown unRaid (Linux) is trying to replay journaled transactions to the disks. While this is happening, depending on your version, the web GUI can be unresponsive for 1/2 hour or more. If you have unmenu installed, you CAN bring that up and see the array status.

 

I believe the lengthy delays are partially caused by running the parity check at the same time journaled transactions are being replayed, but I'd recommend giving the system an extended time to come up (at least 1/2 hour) before forcing a second hard reboot.

  • Author

When it locks it will not respond by the web gui, telnet, unmenu, or direct input to the keyboard (i.e. attempted login).  Also when it locks all disk activity immediately ceases.

  • Author

If some of my hardware is incompatible I would love to buy replacements today on Cyber Monday.  Tonight when I boot it I will leave it in the locked state for at least one hour to verify the above problem is not actually occurring.

 

Thanks for the help!

it might be a bad slot in the Supermicro backplane

 

Try removing those from the equation and hooking up all drives outside of that.

  • Author

Will do tonight and reply with the results.  Thanks!

  • Author

I removed all disks from the supermicro arrays and booted the system.  It locked within 20 minutes.  I can't imagine it is a bad supermicro backplane after that.

 

I unplugged the power to all disks and booted the system.  It ran for 1.5 hours and did not lock up.

 

I plugged in the power to parity, disk1 and disk2 and just booted the system.  I will reply with the result.

How many drives do you have in the system?

 

I noticed that the PSU you are using has 2 +12V rails with 28Amps each.  This is not ideal for an unRAID machine as you really want a single +12V rail.

  • Author

Normally I have 7 drives.  6 of them are WD Caviar Green and my cache is an old 300GB Maxtor of which I am not sure the speed.  I have enough space to reduce the number of drives if needed, however I don't think I can until I get through the rebuild.  I have a spare PSU sitting around, a LC-A350ATX and some standard -> sata power converters, but I don't have enough to do all the drives.

 

I'm not sure exactly what you mean by the double vs. single 12 volt rail.  Can you elaborate a little so I know what I need to replace my existing one with?

 

Thank you!!

Archived

This topic is now archived and is closed to further replies.